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To Ira, Shobha, and Shampa 


Preface 


Though there are many recent additions to graduate-level introductory books 
on Bayesian analysis, none has quite our blend of theory, methods, and ap- 
plications. We believe a beginning graduate student taking a Bayesian course 
or just trying to find out what it means to be a Bayesian ought to have some 
familiarity with all three aspects. More specialization can come later. 

Each of us has taught a course like this at Indian Statistical Institute or 
Purdue. In fact, at least partly, the book grew out of those courses. We would 
also like to refer to the review (Ghosh and Samanta (2002b)) that first made 
us think of writing a book. The book contains somewhat more material than 
can be covered in a single semester. We have done this intentionally, so that 
an instructor has some choice as to what to cover as well as which of the 
three aspects to emphasize. Such a choice is essential for the instructor. The 
topics include several results or methods that have not appeared in a graduate 
text before. In fact, the book can be used also as a second course in Bayesian 
analysis if the instructor supplies more details. 

Chapter 1 provides a quick review of classical statistical inference. Some 
knowledge of this is assumed when we compare different paradigms. Following 
this, an introduction to Bayesian inference is given in Chapter 2 emphasizing 
the need for the Bayesian approach to statistics. Objective priors and objec- 
tive Bayesian analysis are also introduced here. We use the terms objective 
and nonsubjective interchangeably. After briefly reviewing an axiomatic de- 
velopment of utility and prior, a detailed discussion on Bayesian robustness is 
provided in Chapter 3. Chapter 4 is mainly on convergence of posterior quan- 
tities and large sample approximations. In Chapter 5, we discuss Bayesian 
inference for problems with low-dimensional parameters, specifically objec- 
tive priors and objective Bayesian analysis for such problems. This covers 
a whole range of possibilities including uniform priors, Jeffreys’ prior, other 
invariant objective priors, and reference priors. After this, in Chapter 6 we 
discuss some aspects of testing and model selection, treating these two prob- 
lems as equivalent. This mostly involves Bayes factors and bounds on these 
computed over large classes of priors. Comparison with classical P-value is 
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also made whenever appropriate. Bayesian P-value and nonsubjective Bayes 
factors such as the intrinsic and fractional Bayes factors are also introduced. 

Chapter 7 is on Bayesian computations. Analytic approximation and the 
E-M algorithm are covered here, but most of the emphasis is on Markov chain 
based Monte Carlo methods including the M-H algorithm and Gibbs sampler, 
which are currently the most popular techniques. Follwing this, in Chapter 8 
we cover the Bayesian approach to some standard problems in statistics. The 
next chapter covers more complex problems, namely, hierarchical Bayesian 
(HB) point and interval estimation in high-dimensional problems and para- 
metric empirical Bayes (PEB) methods. Superiority of HB and PEB methods 
to classical methods and advantages of HB methods over PEB methods are 
discussed in detail. Akaike information criterion (AIC), Bayes information 
criterion (BIC), and other generalized Bayesian model selection criteria, high- 
dimensional testing problems, microarrays, and multiple comparisons are also 
covered here. The last chapter consists of three major methodological appli- 
cations along with the required methodology. 

We have marked those sections that are either very technical or are very 
specialized. ‘These may be omitted at first reading, and also they need not be 
part of a standard one-semester course. 

Several problems have been provided at the end of each chapter. More 
problems and other material will be placed at http://www.isical.ac.in/~ 
tapas/book 

Many people have helped — our mentors, both friends and critics, from 
whom we have learnt, our family and students at ISI and Purdue, and the 
anonymous referees of the book. Special mention must be made of Arijit 
Chakrabarti for Sections 9.7 and 9.8, Sudipto Banerjee for Section 10.1, Partha 
P. Majumder for Appendix D, and Kajal Dihidar and Avranil Sarkar for help 
in several computations. We alone are responsible for our philosophical views, 
however tentatively held, as well as presentation. 

Thanks to John Kimmel, whose encouragement and support, as well as 
advice, were invaluable. 


Indian Statistical Institute and Purdue University Jayanta K. Ghosh 
Indian Statistical Institute Mohan Delampady 
Indian Statistical Institute Tapas Samanta 


February 2006 
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Statistical Preliminaries 


We review briefly some of the background that is common to both classical 
statistics and Bayesian analysis. More details are available in Casella and 
Berger (1990), Lehmann and Casella (1998), and Bickel and Doksum (2001). 
The reader interested in Bayesian analysis can go directly to Chapter 2 after 
reading Section 1.1. 


1.1 Common Models 


A statistician, who has been given some data for analysis, begins by providing 
a probabilistic model of the way his data have been generated. Usually the 
data can be treated as generated by random sampling or some other random 
mechanism. Once a model is chosen, the data are treated as a random vec- 
tor X = (X 1, X2,..., Xn). The probability distribution of X is specified by 
f(x|0) which stands for a joint density (or a probability mass function), and 8 
is an unknown constant or a vector of unknown constants called a parameter. 
The parameter @ may be the unknown mean and variance of a population 
from which X is a random sample, e.g., the mean life of an electric bulb or 
the probability of doing something, vide Examples 1.1, 1.2, and 1.3 below. 
Often the data X are collected to learn about @, i.e., the modeling precedes 
collection of data. The set of possible values of 8, called the parameter space, 
is denoted by ©, which is usually a p-dimensional Euclidean space RP or some 
subset of it, p being a positive integer. Our usual notation for data vector and 
parameter vector are X and @, respectively, but we may use X and @ if there 
is no fear of confusion. 


Example 1.1. (normal distribution). X1, X2,..., Xn are heights of n adults (all 
males or all females) selected at random from some population. A common 
model is that they are independently, normally distributed with mean yp and 
variance g?, where —oo < pp < œ and g? > 0, i.e., with 8 = (u, 07), 
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f(z|@) = [I/10 = Hazel E) 


We write this as X;’s are i.i.d. (independently and identically distributed) 
N(p, 07). 

If one samples both genders the model would be much more complicated 
— X;’s would be i.i.d. but the distribution of each X; would be a mixture of 
two normals N(ur,0%,) and N(um, c2) where F and M refer to females and 
males. 


Example 1.2. (exponential distribution). Suppose a factory is producing some 
electric bulbs or electronic components, say, switches. If the data are a random 
sample of lifetimes of one kind of items being produced, we may model them 
as i.i.d. with common = density 


f(z) = 50 78, t;>0,6>0. 


Example 1.3. (Bernoulli, binomial distribution). Suppose we have n students 
in a class with 


1 if ith student has passed a test; 
A= 
0 otherwise. 


We model X;’s as i.i.d. with the Bernoulli distribution: 


A = a if z; = 0, 


which may be written more compactly as 0%:(1 — 0)!-®:. The parameter @ is 
the probability of passing. The joint probability function of X1, X2,..., Xn is 


f(æ0) = [Ire = [a-e j7}, ø € (0,1). 


If Y = `] Xi, the number of students who pass, then P(Y = y) = (laze! — 
0)”-Y, which is a binomial distribution, denoted B(n, 0). 


Example 1.4. (binomial distribution with unknown n’s and unknown p}. Sup- 
pose Y1, Y2,..., Yp are the number of reported burglaries in a place in k years. 
One may model Y;’s as independent B(n;, p), where n; is the number of actual 
burglaries (some reported, some not) in ith year and p is the probability that 
a burglary is reported. Here @ is (n1,..., nk, p). 


Example 1.5. (Poisson distribution). Let X1, X2,..., Xn be the number of ac- 
cidents on a given street in n years. X;’s are modeled as i.i.d P(A), i.e., Poisson 
with mean A, 


Ti 


P(X:= g) = f(z) = exp(-A) 2, 


ce | eal ES A > 0. 
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Example 1.6. (relation between binomial and Poisson). It is known B(n, p) 
is well approximated by P(A) if n is large, p is small but np = is nearly 
constant, or, more precisely, n — oo, p —> 0 in such a way that np > à. This 
is used in modeling distribution of defective items among some particular 
products, e.g., bulbs or switches or clothes. Suppose a lot size n is large. 
Then the number of defective items, say X, is assumed to have a Poisson 
distribution. l 


Closely related to the binomial are three other distributions, namely, geo- 
metric, negative binomial, which includes the geometric distribution, and the 
multinomial. All three, specially the last, are important. 


Example 1.7. (geometric). Consider an experiment or trial with two possible 
outcomes — success with probability p and failure with probability 1 — p. For 
example, one may be trying to hit a bull’s eye with a dart. Let X be the 
number of failures in a sequence of independent trials until the first success is 
observed. ‘Then 

Pix =e a(t es = 0.1, or 


This is a discrete analogue of the exponential distribution. 


Example 1.8. (Negative binomial). In the same setup as above, let k be given 
and X be the number of failures until k successes are observed. Then 


k-1 
P{X =a} = Po rap) x =0,1,.. 


This is the negative binomial distribution. The geometric distribution is a 
special case. 


Example 1.9. (multinomial). Suppose an urn has N balls of k colors, the num- 
ber of balls of jth color is N; = Np; where 0 < p; < 1, 4 p; = 1. We take 
a random sample of n balls, one by one and with replacement of the drawn 
ball before the next draw. Let X; = j if the ith ball drawn is of jth color 
and let n; = frequency of balls of the jth color in the sample. Then the joint 
probability function of X1, X2,..., Xn is 


k 
t(alp) = J] p”, 


j=l 


and the joint probability function of n1,..., Nk iS 


Tena He 


The latter is called a multinomial distribution. We would also refer to the 
joint distribution of X’s as multinomial. 
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Instead of considering specific models, we introduce now three families of 
models that unify many theoretical discussions. In the following X is a k- 
dimensional random vector unless it is stated otherwise, and f has the same 
connotation as before. 


1.1.1 Exponential Families 


Consider a family of probability models specified by f (x0), 0 € O. The family 
is said to be an exponential family if f(x]0) has the representation 


f(w|@) = exp ¢ c(@) + X` t;(a).A;(8) } h(a), (1.1) 
j=l 


where c(.), A;(.) depend only on @ and t;(.) depends only on æ. Note that 
the support of f(x|@), namely, the set of x where f(a|@) > 0, is the same as 
the set where h(x) > 0 and hence does not depend on @. To avoid trivialities, 
we assume that the support does not reduce to a single point. 

Problem 1 invites you to verify that Examples 1.1 through 1.3 and Exam- 
ple 1.5 are exponential families. 

It is easy to verify that if X;, i = 1,...,n, are iid. with density f(x|0), 
then their joint density is also exponential: 


n 


p n 
f(x:|0) = exp ¢ ne(0) + ` T; A;(0) Il h(a;), 
i=1 j=l i=1 
with T} = Jiz tj (ai). 
There are two convenient reparameterizations. Using new parameters we 
may assume A;(0) = 0j. Then 


f(£|0) = exp { c(0) + X t;(£)0; p h(a). (1.2) 


j=l 


The general theory of exponential families, see, e.g., Brown (1986), ensures 
one can interchange differentiation and integration. Differentiation once under 
the integral sign leads to 











o Oc 
= Ep (| —1 = — + Ept;( X), j = 1,...,p. 
0 Pa) & og f(X10)) 39; g ot;( )s J ; P (1.3) 
In a similar way, 
8? log f ô log f ð log f 
He (Face) =a ( 60; 8; ) (1.4) 
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In the second parameterization, we set nj = Eg(t;(X)), i.e., 


Oc 
eee ae ee ery a l. 
nj a0,’ T 7 ‚P ( 5) 


In Problem 3, you are asked to verify for p = 1 that ņ is a one-one function 
of 0. A similar argument shows 7 = (m, ..., Np) is a one-one function of 8. 

The parameters @ are convenient mathematically, while the usual statisti- 
cal parameters are closer to 7. You may wish to calculate 7’s and verify this 
for Examples 1.1 through 1.3 and Example 1.5. 


1.1.2 Location-Scale Families 


Definition 1.10. Let X be a real- valued random variable, with density 
1 T= 
flmo) = g (ZÆ), 
o o 


where g is also a density function, —oo < u < œ, o > 0. The parameters u 
and a are called location and scale parameters. 


With X as above, Z = (X — u)/c has density g. The normal N (u, 0°) is a 
location-scale family with Z being the standard normal, N (0, 1). Example 1.2 
is a scale family with u = 0, 0 = 0. We can make it a location-scale family if 
we set 


t exp (=+) for x > p; 
F{a|p, 0) = l 0 otherwise. 


but then it ceases to be an exponential family for its range depends on u. The 
other examples, namely, Bernoulli, binomial, and Poisson are not location- 
scale families. 


Example 1.11. Let X have uniform distribution over (61,82) so that 


if 0, < z < ba; 
otherwise. 


E: 
f(z|8) = l a 


This is also a location-scale family, with a reparameterization, which is not 
an exponential family. 


Example 1.12. The Cauchy distribution specified by the density 


1 oO 


= Fr — LL 
To? + (a — pu) D 


f(z|p,0) = 


is a location-scale family that is not exponential. It has several interesting 
properties. As |x| — oo, it tends to zero but at a much slower rate than the 
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Fig. 1.1. Densities of Cauchy(0, 1) and normal(0, 2.19). 


normal. One can verify that E(| X|") = oo for r = 1,2,... under any ps, ø. So 
Cauchy has no finite moment. However, Figure 1.1 shows remarkable similarity 
between the normal and Cauchy, except near the tails. The Cauchy density is 
much flatter at the tails than the normal, which means gx’s that deviate quite 
a bit from p will appear in data from time to time. Such deviations from p 
would be unusual under a normal model and so may be treated as outliers by 
a data analyst. It provides an important counter-example to the law of large 
numbers or central limit theorem when one has infinite moments. It also plays 
an important role in robustness studies (see, e.g., Section 3.9). 


Finally, many of the attractive statistical properties of the normal arise 
from the fact that it is both an exponential and a location-scale family, thereby 
inheriting interesting properties of both. 


1.1.3 Regular Family 


We end this section with a third very general family, defined by what are 
called mathematical regularity conditions. 


Definition 1.13. A family of densities f(a|@) is said to satisfy Cramer-Rao 
type regularity conditions if the support of f(a|@), i.e., the set of x for which 
f(x|@) > 0, does not depend on 0, f is k times continuously differentiable with 
respect to O (with k usually equal to two or three) and one can differentiate 
under the integral sign as indicated below for real-valued @: 


1.2 Likelihood Function T 


Bo (Sloe s(xie)) = [| Foe fe) } F) ax 


— OO 


. D L fald = 2 I f(2]0) dæ = 0, (1.6) 


and similarly, 


Es (i sxa) =— (5 og f(el)) elo) de. (1.7) 


— OQ 


The condition that the support of f(-|@) is free of 0 is required for the 
last two relations to hold. The results of Chapter 4 require regularity condi- 
tions of this kind. The exponential families satisfy these regularity conditions. 
Location-scale families may or may not satisfy, usually the critical assumption 
is that relating to the support of f. Thus the Cauchy location-scale family 
satisfies these conditions but not the uniform or the exponential density 


f(z|u, 0o) = = exp (-=) , £ >H. 


1.2 Likelihood Function 


A concept of fundamental importance is the likelihood function. Informally, 
for fixed æ, the joint density or probability mass function (p.m.f.) f(a|@), 
regarded as a function of @, is called the likelihood function. When we think 
of f as the likelihood function we often suppress œ and write f as L(@). The 
likelihood function is not unique in that for any c(a) > 0 that may depend on 
x but not on 0, c(a) f(x|@) is also a likelihood function. What is unique are 
the likelihood ratios L(@2)/L(01), which indicate how plausible is 62, relative 
to 81, in the light of the given data æ. In particular, if the ratio is large, we 
have a lot of confidence in @2 relative to @; and the reverse situation holds if 
the ratio is small. Of course the threshold for what is large or small isn’t easy 
to determine. 

It is important to note that the likelihood is a point function. It can provide 
information on relative plausibility of two points 0; and @2, but not of two 
6-sets, say, two non-degenerate intervals. 

If the sample size n is large, usually the likelihood function has a sharp 
peak as shown in the following figure. Let the value of @ where the maximum is 
attained be denoted as the maximum likelihood estimate (MLE) 6; we define 
it formally later. In situations like this, one feels 6 is very plausible as an 
estimate of @ relative to any other points outside a small interval around @. 
One would then expect @ to be a good estimate of the unknown 9, at least 
in the sense of being close to it in some way (e.g., of being consistent, i.e, 
converging to 9 in probability). We discuss these things more carefully below. 
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Fig. 1.2. L(@) for the double exponential model when data is normal mixture. 


Classical statistics also asserts that under regularity conditions and for large 
n, the maximum likelihood estimate minimizes the variance approximately 
within certain classes of estimates. Problem 10 provides a counter-example 
due to Basu (1988) when regularity conditions do not hold. 


Definition 1.14. The maximum likelihood estimate (MLE) @ is a value of 0 
where the likelihood function L(@) = f(x|@) attains its supremum, i.e., 


sup f(x|@) = f(x|0). 
Usually, the MLE can be found by solving the likelihood equation 


2 log f(a|@) = 0, j =1,...,p. (1.8) 
J 

In Problem 4(b), you are asked to show the likelihood function is log- 
concave, i.e., its logarithm is a concave function. In this case, if (1.8) has a 
solution, it is unique and provides a global maximum. There are well-known 
theorems, see, e.g., Rao (1973), which show the existence of a solution of (1.8) 
which converges in probability to the unknown true @ if the dimension is fixed 
and Cramer-Rao type regularity conditions hold. If (1.8) has multiple roots, 
one has to be careful. A simple solution is to first find a ,/n-consistent estimate 
Tn, i.e., an estimate Tn such that Vn(Tn — 0) is bounded in probability. Then 
choose a solution that is nearest to Tan. 
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1.3 Sufficient Statistics and Ancillary Statistics 


Given the importance of likelihood function, it is interesting and useful to 
know what is the smallest set of statistics T; (x),...,Zm{(a) in terms of which 
one can write down the likelihood function. As expected this makes it neces- 
sary to introduce sufficient statistics. 


Definition 1.15. Let X be distributed with density f(x|@). Then T = T(X) 
= (T,(X),...,Tm(X)) is sufficient for @ if the conditional distribution of X 
given T is free of @. 


A basic fact for verifying whether T is sufficient is the following factoriza- 
tion theorem: T is sufficient for 0 iff f(a|@) = g(Ti(a),...,Tm(x), A)h(a). 

Using this, you are invited to prove (Problem 20) that the likelihood func- 
tion can be written in terms of T iff T is sufficient. 

Thus the problem of finding the smallest T in terms of which one can 
write down the likelihood function reduces to the problem of finding what are 
called minimal sufficient statistics. 


Definition 1.16. A sufficient statistic To is minimal sufficient (or smallest 
among sufficient statistics) if To is a function of every sufficient statistic. 


Clearly, a one-one function of a minimal sufficient statistic is also mini- 
mal sufficient. In spite of the somewhat abstract definition, minimal sufficient 
statistics are usually easy to find by inspection. Most examples in this book 
would be covered by the following fact (Problem 19). 


Fact. Suppose X,;,27 = 1,2,...,7 are i.i.d. from exponential family. ‘Then 
T=} (X), j= 1,...,p) together form a minimal sufficient statistics 
and hence is the smallest set of statistics in terms of which we may write down 
the likelihood function. 


Using this, you can prove (37) X;, 5°) X?) is minimal sufficient for p 
and o? if Xi, X2,..., Xn are iid. N(u,07). This in turn implies (X,s? = 
+ D(X; — X)?) is also minimal sufficient for (u, o°), being a one-one func- 
tion of (S77 Xi, 007 X?). In the same way, X is minimal sufficient for both i.i.d. 
B(1, p) and P(A). In Problem 10, one has to show X(1) = min(X1, X2,...,Xn) 
and Xin) = max( X1, X2,..., Xn) are together minimal sufficient for U (0, 26). 
A bad case is that of iid. Cauchy(y,07). It is known (see, e.g., Lehmann 
and Casella (1998)) that the minimal sufficient statistic is the set of all order 
statistics (X(1),X(2),.--,X(n)) where Xa) and X(n) have been defined earlier 
and X,,) is the rth X when the X;’s are arranged in ascending order (as- 
suming all X;’s are distinct). This is a bad case because the order statistics 
together are always sufficient when X;,’s are i.i.d., and so if this is the minimal 
sufficient statistic, it means the density is so complicated that the likelihood 
cannot be expressed in terms of a smaller set of statistics. The advantage 
of sufficiency is that we can replace the original data set x by the minimal 
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sufficient statistic. Such reduction works well for i.i.d. random variables with 
an exponential family of distributions or special examples like U (01,02). It 
doesn’t work well in other cases including location-scale families. 

There are various results in classical statistics that show a sufficient statis- 
tic contains all the information about 9 in the data X. At the other end is 
a statistic whose distribution does not depend on @ and so contains no infor- 
mation about @. Such a statistic is called ancillary. 

Ancillary statistics are easy to exhibit if X,,...,X, are iid. with a 
location-scale family of densities. In fact, for any four integers a, b, c, and 
d, the ratio 

X(a) — Xo) _ Za) ~ Zeo) 
Xeo = Xua) Ze — Za) 


is ancillary because the right-hand side is expressed in terms of order statistics 
of Z;’s where Z; = (X; — u)/o, i = 1,...,n are iid. with a distribution free 
of u and o. 

There is an interesting technical theorem, due to Basu, which establishes 
independence of a sufficient statistic and an ancillary statistic. The result 
is useful in many calculations. Before we state Basu’s theorem, we need to 
introduce the notion of completeness. 


Definition 1.17. A statistic T or its distribution is said to be complete if for 
any real valued function (T), 


Eow(T(X)) =0V 06 implies Y(T(X))=0 
(with probability one under all 8). 


Suppose T is discrete. The condition then simply means the family of 
p.m.f.’s f7(t|@) of T is rich enough that there is no non-zero w(t) that is 
orthogonal to f7 (tð) for all 8 in the sense $`, w(t) fT (t/@) = 0 for all 6. 


Theorem 1.18. (Basu). Suppose T is a complete sufficient statistic and U 
is any ancillary statistic. Then T and U are independent for all 0. 


Proof. Because T is sufficient, the conditional probability of U being in some 
set B given T is free of 6 and may be written as Pe(U € BIT) = ¢(T). 
Since U is ancillary, Eo(¢(T)) = Pe(U € B) = c, where c is a constant. 
Let Y(T) = (T) — c. Then Egy(T) = 0 for all 6, implying Y(T) = 0 (with 
probability one), i.e., Pa(U € BIT) = Pa(U € B). OU 


It can be shown that a complete sufficient statistic is minimal sufficient. 
In general, the converse isn’t true. For exponential families, the minimal suf- 
ficient statistic (Ti,...,T>) = (cy ti(Xi),.--, oy tp(X,)) is complete. For 
X1,X2,...,Xn iid. U(61,42), (Xa), Xín)) is a complete sufficient statistic. 
Here are a couple of applications of Basu’s theorem. 
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Example 1.19. Suppose X1, X2,...,Xn are iid. N(y,07). Then X and s? = 
-4 D(X; — X}? are independent. To prove this, treat a° as fixed to start 
with and u as the parameter. Then X is complete sufficient and s? is ancillary. 
Hence X and s? are independent by Basu’s theorem. 


Example 1.20. Suppose X1, X2,..., Xn are i.i.d U (61,62). Then for any 1 < 
r<n,Y = (Xr) — Xm) (Xim — Xa) is independent of (Xa), X(n)). This 
follows because Y is ancillary. 


A somewhat different notion of sufficiency appears in Bayesian analysis. 
Its usefulness and relation to (classical) sufficiency is discussed in Appendix E. 


1.4 Three Basic Problems of Inference in Classical 
Statistics 


For simplicity, we take p = 1, so @ is a real-valued parameter. Informally, 
inference is an attempt to learn about @. There are three natural things one 
may wish to do. One may wish to estimate @ by a single number. A classical 
estimate used in large samples is the MLE @. Secondly, one may wish to 
choose an interval that covers 0 with high probability. Thirdly, one may test 
hypotheses about 0, e.g., test what is called a null hypothesis Hp : 0 = 0 
against a two-sided alternative Hı : @ Æ 0. More generally, one can test 
Ho : 8 = 9% against Hı : @ Æ ĝo where ĝo is a value of some importance. For 
example, @ is the effect of some new drug on one of the two blood pressures, 
or Oo is the effect of an alternative drug in the market and one is trying to 
test whether the new drug has different effects. If one wants to test whether 
the new drug is better then instead of Hı : 0 Æ 09, one may like to consider 
one-sided alternatives Hı : 6 < bo or H; : 0 > bo. 


1.4.1 Point Estimates 


In principle, any statistic T(X) is an estimate though the context usually 
suggests some special reasonable candidates like sample mean X or sample 
median for a population mean like u of N(u,o?). To choose a satisfactory or 
optimal estimate one looks at the properties of its distribution. The two most 
important quantities associated with a distribution are its mean and variance 
or mean and the standard deviation, usually called the standard error of 
the estimate. One would usually report a good estimate and estimate of the 
standard error. So one judges an estimate T by its mean E(T|0) and variance 
Var(T|@). If we are trying to estimate 6, we calculate the bias E(T|@) — 0. 
One prefers small absolute values of bias, one possibility is to consider only 
unbiased estimates of 0 and so one requires E(T|0) = 0. Problem 17 requires 
you to show both X and the sample median are unbiased estimates for p in 
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N(p, 07). If the object is to estimate some real-valued function T(0) of 6, one 
would require £(T|@) = 7(@). 

For unbiased estimates of 7, Var(T|0) = E{(T — 7(6))?|@} measures how 
dispersed T is around 7(@). The smaller the variance the better, so one may 
search for an unbiased estimate that minimizes the variance. Because @ is not 
known, one would have to try to minimize variance for all 0. This is a very 
strong condition but there is a good theory that applies to several classical 
examples. In general, however it would be unrealistic to expect that such an 
optimal estimate exists. We will see the same difficulty in other problems 
of classical inference. We now summarize the basic theory in a somewhat 
informal manner. 


Theorem 1.21. Cramer-Rao Inequality (information inequality). Let 
T be an unbiased estimate of T(0). Suppose we can interchange differentiation 
and integration to get 


: £ E(T|O) = -f f Tor ede. 


o- [E [rena] = f f rema 


[o 
Var(T|8) > T0) 


and 


Then, 





where the’ in T and f indicates a derivative with respect to 0 and In(0) is 
Fisher information in x, namely, 


I,(0) =E (5 log rD) | o}. 


Proof. Let u(X,0) = Slog f(X|0). The second relation above implies 
E(w(X,9)|@) = 0 and then, Var(~(X,6)|/@) = In(8). The first relation im- 
plies 

Cov(T, J(X, 6) | 6) = 7'(8). 


It then follows that 

[Cov(T, W(X, 9)1))? | 70)? 
Var(y(X,)|@)  — Tn(A) 

If X1,...,Xp are iid. f(x|@), then 


Var(T |) > 





In(8) = nI(9) 


where I(@) is the Fisher information in a single observation, 
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I(0)= E l (5 log f(Xil0)) oh | 


To get a feeling for [,,(@), consider an extreme case where f(ax|@) is free 
of 0. Clearly, in this case there can be no information about @ in X. On the 
other hand, if I„ (8) is large, then on an average a small change in @ leads to a 
big change in log f(a!6@), i.e., f depends strongly on @ and one expects there 
is a lot that can be learned about @ and hence 7(@). A large value of I,,(0) 
diminishes the lower bound making it plausible that one may be able to get 
an unbiased estimate with small variance. 

Finally, if the lower bound is attained at all 6 by T, then clearly T is 
a uniformly minimum variance unbiased (UMVUE) estimate. We would call 
them best unbiased estimates. 

A more powerful method of getting best unbiased estimates is via the 
Rao-Blackwell theorem. 


Theorem 1.22. (Rao-Blackwell). Jf T is an unbiased estimate of T(0) and 
S is a sufficient statistic, the T’ = E(T|S) is also unbiased for T(0) and 


Var(T'|0) < Var(T|0) Y8. 


Corollary 1.23. [fT is complete and sufficient, then T’ as constructed above 
is the best unbiased estimate for T(0). 


Proof. By the property of conditional expectations, 
E(T"|6) = E{E(T|S) | 0} = E (718). 
(You may want to verify this at least for the discrete case.) Also, 
Var(T|@) = E [{(T —T’) + (T — 7(0))¥ | 0] 
= E {(T — T")? | 0} + E {(T' —7(0))? | 6}, 
because 
Cov {T — T',T' —7(0)|0}= E{(T —T’)(T’ — (8) | 6} 
= E [E {(T" — 7(6))(T — T’) | 5} 18] 


=E (TTO EE = 1"|5)| 0] 
=o 


The decomposition of Var(T|0) above shows that it is greater than or equal 
to Var(T”|0). O 


The theorem implies that in our search for the best unbiased estimate, 
we may confine attention to unbiased estimates of 7(@) based on S. However, 
under completeness, T” is the only such estimate. 
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Example 1.24. Consider a random sample from N (u, 0°), o? assumed known. 
Note that by either of the two previous results, X is the best unbiased estimate 
for u. The best unbiased estimate for u? is X? — o7/n by the Rao-Blackwell 
theorem. You can show it does not attain the Cramer-Rao lower bound. If a 
T attains the Cramer-Rao lower bound, it has to be a linear function (with 


d= p), r 
T(x) = a(0) + (6) = log f (218), 


i.e., must be a the form 
T(x) = c(0) + d(O)z. 
But T, being a statistic, this means 
T(x) =c+ dz, 
where c, d are constants. 


A similar argument holds for any exponential family. Conversely, suppose 
a parametric model f (x|0) allows a statistic T to attain the Cramer-Rao lower 
bound. Then, 


(2) = a(0) + 6(6) 5 log f(2l0), 


which implies (2) 6) 
T(x) — al(@ d 
66) = zg 108 f (210). 


Integrating both sides with respect to 9, 
T(z) f (b(6))-! do — / a(0)(b(6))~! d8 + d(x) = log f («!é), 


where d(x) is the constant of integration. If we write A(@) = {(6(0))~+dé, 
c(9) = f a(@)b(6)—! d8 and d(x) = log h(a), we get an exponential family. 
The Cramer-Rao inequality remains important because it provides infor- 
mation about variance of T. Also, even if a best unbiased estimate can’t be 
found, one may be able to find an unbiased estimate with variance close to 
the lower bound. A fascinating recent application is Liu and Brown (1992). 
An unpleasant feature of the inequality as formulated above is that it 
involves conditions on T rather than only conditions on f(x|@). A considerably 
more technical version without this drawback may be found in Pitman (1979). 
The theory for getting best unbiased estimates breaks down when there 
is no complete sufficient statistic. Except for the examples we have already 
seen, complete sufficient statistics rarely exist. Even when a complete sufficient 
statistic exists, one has to find an unbiased estimate based on the complete 
sufficient statistic S. This can be hard. Two heuristic methods work some- 
times. One is the method of indicator functions, illustrated in Problem 5. 
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The other is to start with a plausible estimate and then make a suitable ad- 
justment to make it unbiased. Thus to get an unbiased estimate for u? for 
N(p, 07), one would atar with X?. e know for sure X? can’t be unbiased 
since E(X ?|u, o?) = p? + o? /n. So if o? is known, we can use X? — o?/n. If 
o? is unknown, we can n X? — s?/n, where s? = Ð (X; — X)?/(n— 1) is an 
unbiased estimate of o°. Note that X? — s?/n is a function of the complete, 
sufficient statistic (X, s?) but may be negative even though pu? is a positive 
quantity. 

For all these reasons, unbiasedness isn’t important in classical statistics 
as it used to be. Exceptions are in unbiased estimation of risk (see Berger 
and Robert (1990), Lu and Berger (1989a, b)) with various applications and 
occasionally in variance estimation, specially in high-dimensional problems. 
See Chapter 9 for an application. 

We note finally that for relatively small values of p and relatively large 
values of n, it is easy to find estimates that are approximately unbiased and 
approximately attain the Cramer-Rao lower bound in a somewhat weak sense. 
An informal introduction to such results appears below. 

Under regularity conditions, it can be shown that 





li 


vn (6-6) — TIO T- ap log f(X; o) 0: 


This implies 6 is approximately normal with mean @ and variance (nI(6))~!, 
which is the Cramer-Rao lower bound when we are estimating @. Thus Ô is 
approximately normal with expectation equal to 6 and variance equal to the 
Cramer-Rao lower bound for 7(0) = 8. For a general differentiable 7(@), we 
show 7(0) has similar properties. Observe that +(0) = 7(0) + (Ô — 0)r'(0)+ 
smaller terms, which exhibits r(0) as an approximately linear function of 6. 
Hence 7(6) is also approximately normal with 


mean =7(@)+ (approximate) mean of (4 — 0) 7'(@) = 7(9), and 


1 
nI (0) 


The last expression is the Cramer-Rao lower bound for 7(0). The method of 
approximating 7(6) by a linear function based on Taylor expansion is called 
the delta method. 

For N(u,07) and fixed z, let 7(@) = 7(0,2) = P{X < alyu,o}. An 
) where 


variance = (rOy x approximate variance of (4 = o) = (roy? 





approximately best unbiased estimate is P{X < zļĝ, ô} = (2# 
s = 4/45 (X; — X)? and &(.) is the standard normal distribution function. 


The exact best unbiased estimate can be obtained by the method of indicator 
functions. Let 

1 if X 1 £ T; 

0 otherwise . 


moa l 
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Then J is an unbiased estimate of 7(@), so the best unbiased estimate is 
E(I|X, 8’) = P{X, < x2|X,s'}. The explicit form is given in Problem 5. 


1.4.2 Testing Hypotheses 


We consider only the case of real-valued ĝo, the null hypothesis Ho : 0 = 4% 
and the two-sided alternative Hı : 6 Æ @ or, one-sided null and one-sided 
alternatives, e.g., Ho : 0 < Oo and H; : 8 > ĝo. In this formulation, the null 
hypothesis represents status quo as in the drug example. It could also mean an 
accepted scientific hypothesis, e.g., on the value of the gravitational constant 
or velocity of light in some medium. This suggests that one should not reject 
the null hypothesis unless there is compelling evidence in the data in favor of 
H,. This fact will be used below. 

A test is a rule that tells us for each possible data set (under our model 
f(x|@)) whether to accept or reject Ho. Let W be the set of x’s for which 
a given test rejects Hyp and W° be the set where the test accepts Ho. The 
region W, called a critical region or rejection region, completely specifies the 
test. Sometimes one works with the indicator of W rather than W itself. The 
collection of all subsets W in R” or their indicators correspond to all possible 
tests. How does one evaluate them in principle or choose one in some optimal 
manner? The error committed by rejecting Hp when Ho is true is called the 
error of first kind. Avoiding this is considered to be more important than the 
so called second kind of error committed when Ho is accepted even though 
Hı is true. For any given W, 


Probability of error of first kind = Py, (X € W) = Eg, (I(X)), 


where (a) is the indicator of W, 


1 ifxeW; 
Ha) = 4 if x e W°. 


Probability of error of second kind = P(X € W°) = 1 — Eọ(I(X)), for 0 
as in Hı. One also defines the power of detecting Hı as 1 — P(X € W°) = 
Eo(I(X)) for 0 as in Ay. 

It turns out that in general if one tries to reduce one error probability the 
other error probability goes up, so one cannot reduce both simultaneously. 
Because probability of error of first kind is more important, one first makes 
it small, 


Eo, (I(X)) < a, (1.9) 


where a, conventionally .05, .01, etc., is taken to be a small number. Among 
all tests satisfying this, one then tries to minimize the probability of commit- 
ting error of second kind or equivalently, to maximize the power uniformly for 
all 9 as in Hı. You can see the similarity of (1.9) with restriction to unbiased 
estimates and the optimization problem subject to (1.9) as the problem of 
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minimizing variance among unbiased estimates. The best test is called uni- 
formly most powerful (UMP) after Neyman and Pearson who developed this 
theory. 

It turns out that for exponential families and some special cases like U (0, 8) 
or U(0 — 5,0 + 4), one can find UMP tests for one-sided alternatives. The 
basic tool is the following result about a simple alternative Hı : 6 = @,. 


Lemma 1.25. (Neyman-Pearson). Consider Ho : 0 = Qo versus Hı : 0 = 
0. FizxO0<a<l. 

A. Suppose there exists a non-negative k and a test given by the indicator 
function Ip such that 


_ |1 af f(|) > kf(a|o); 
te) i if f(x|01) < kf(æ|8o), 


with no restriction on Ip if f(æ|01) = kf(x|0o)), such that Ea, (lo(X)) = a. 
Then 
Eo, Uo(X)) = Eo (i (X)) 


for all indicators I, satisfying 
Eo, (D (X)) <a. 


1.€., the test given by Ip is MP among all tests satisfying the previous inequal- 
ity. 
B. Suppose g is a given integrable function and we want all tests to satisfy 


Eo (I(X)) =a and f | 1a) dæ = c (same for all I). (1.10) 
Then among all such I, Eg,(I1(X)) is maximum at 


baj= + if f(x|01) > ki f (xlo) + keg(a); 
l 0 if f(2|01) < kif (xlo) + keg(@), 


where kı and kə are two constants such that Ip satisfies the two constraints 
given in (1.10). 

C. If Io exists in A or B and I, is an indicator having the same maximizing 
property as In under the same constraints, then Io(x) and Ihı(Œ) are same 


if f(x|01) — kf (x|9o) #0, in case of A and f(x|81) — kı f(x|0) — kag(x) # 


0, in case of B. 


Proof. A. By definition of Jp and the fact that 0 < (æ) < 1 for all 4, we 
have that 


f, {Uo (@) — A (æ)) (J (x181) — kf (æ|80))} dæ > 0, (1.11) 


which implies 


18 1 Statistical Preliminaries 


: eie f(a|0,) de > Í. To(£) f (26o) dæ — k f 1, (®)f(æ2l0o) dæ 


> ka — ka = 0. 


B. The proof is similar to that of A except that one starts with 
f (lo - D {s (161) — fle) ~ kag(x)} de > 0 
x 


C. Suppose Jp is as in A and J, maximizes f, I f(a|01) dx. i.e., 


f| toftales) de = f nfa) a 


subjected to 


f Io f(x\6@9) dx =a, and | I, f(xl09) dx = a. 
x x 


Then, 
I {Ig — Th}{f(wl0;) — kef (alo) }ax = 0. 


But the integrand {Jp(a) — (æ) f(a]0,) — kf(ax|@9x)} is non-negative for 
all x. Hence 
Io(x) = h(x) if f(x|01) — kf(xj8o) # 0. 


This completes the proof. O 


Remark 1.26. Part A is called the sufficiency part of the lemma. Part B is a 
generalization of A. Part C is a kind of necessary condition for J; to be MP 
provided To as specified in A or B exists. 


If X;’s are iid. N(y,07), then {xæ : f(x|01) = kf(x|ĝo)} has probability 
zero. This is usually the case for continuous random variables. Then the MP 
test, if it exists, is unique. It fails for some continuous random variables like 
X,’s that are i.i.d. U (0,8) and for discrete random variables. In such cases the 
MP test need not be unique. 

Using A of the lemma we show that for N(yu,o07), o? known, the UMP 
test of Hp : u = wo for a one-sided alternative, say, Hı : u > Ho is given by 


7 1 if Z > Ho + za Tai 
p 0 if Z< Ho + zaa» 


where Za is such that P{Z > za} =a with Z ~ N(0,1). 
Fix uı > uo. Note that f(x|uı)/f(x|uo) is an increasing function of Z. 
Hence for any k in A, there is a constant c such that 


f(alyi)>kf(aluo) if and only if zZ >c. 
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So the MP test is given by the indicator 


= Lita Sc 
eS TO ifz<c, 


where c is such that Euo(Io) = a. It is easy to verify c = Ho + Zag /yn does 
have this property. Because this test does not depend on the value of p11, it is 
MP for all 41 > po and hence it is UMP for Ay: p > po. 

In the same way, one can find the UMP test of Hy : p < yo and verify that 
the test now rejects Ho if Z < Ho — Zag / yn. How about Hp : u < Ho versus 
H; : p > uo? Here we consider all tests with the property 


P, o2 (Ho is rejected) < œ for all u < po. 


Using Problem 6 (or 7), it is easy to verify that the UMP test of Hp : u = po 
versus [Hy : u > uo is also UMP when the null is changed to Ho: u < po. 

One consequence of these calculations and the uniqueness of MP tests 
(Part C) is that there is no UMP test against two-sided alternatives. Each 
of the two UMP tests does well for its Hı but very badly at other 01, e.g., 
the UMP test Io for Hp : p = po versus Hy : u > po obtained above has 
Emı(Io) > 0 as uı — —oo. To avoid such poor behavior at some 6’s, one 
may require that the power cannot be smaller than a. Then Ea, (1) < a, and 
Eo(I) > a, 0 A 0o imply Ea (I) = a and Eg(I) has a global and hence a 
local minimum at @ = ĝo. Tests of this kind were first considered by Neyman 
and Pearson who called them unbiased. There is a similarity with unbiased 
estimates that was later pointed out by Lehmann (1986) (see Chapter 1 there). 
Because every unbiased 7 satisfies conditions of Part B with g = f’(a|@9), one 
can show that the MP test for any 01 4 8 satisfies conditions for Jọ. With a 
little more effort, it can be shown that the MP test is in fact 


1. AE. Ses Or es 
ig = : L 
0 ifc < Z< c2, 


for suitable cı and c2. The given constraints can be satisfied if 


E o q 7 a 
AE Kaa T an ASO eaa T 
This is the UMP unbiased test. 

We have so far discussed how to control œ, the probability of error of 
first kind and then, subject to this and other constraints, minimize 8(0), the 
probability of error of second kind. But how do we bring G(@) to a level that 
is desired? This is usually done by choosing an appropriate sample size n, see 
Problem 8. 

The general theory for exponential families is similar with T = J7 t(x;) 
or T/n taking on the role of z. However, the distribution of T may be discrete, 
as in the case of binomial or Poisson. Then it may not be possible to find the 
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constants c or C1, C2. Try, for example, the case of B(5, p), Ho : p = 5 versus 
Hi :p> L, a = .05. In practice one chooses an a’ < a and as close to @ as 
possible for which the constants can be found and the lemma applied with a’ 
instead of a. 
A different option of some theoretical interest only is to extend the class 
of tests to what are called randomized tests. A randomized test is given by a 
function 0 < ¢(a) < 1, which we interpret as the probability of rejecting Ho 
given a. By setting ¢ equal to an indicator we get back the non-randomized 
tests. With this extension, one can find a UMP test for binomial or Poisson 
of the form 
1 at Tc; 
do= 4 0 if T <c; 
wT SG, 


where 0 < y < 1 is chosen along with c so that Ee, (o) = a. Such use 
of randomization has some other theoretical advantages. Randomization is 
sometimes needed to get a minimax test (i.e., a test that minimizes maximum 
probability or error of either kind), vide, Problem 14. Most important of all, 
randomization leads to the convexity of the collection of all tests in the sense 
that if ġı(x) and ¢2(a) are two randomized or non-randomized tests, the 
convex combination A¢; + (1 — A)de2, 0 < A < 1, is again a function ¢(x) lying 
between 0 and 1 and so it is a randomized test. This leads to convexity of risk 
set (Problem 15). 

Except for exponential families and a few special examples, UMP tests 
don’t exist. However, just as in the case of estimation theory, there are ap- 
proximately optimum tests based directly on maximum likelihood estimates 
of ð or the likelihood ratio statistic 


_ _ f(#19o) 
suPgco, f(x|)’ 


where 0 is the set specified by Hı. 


1.4.3 Interval Estimation 


A commonly used so called confidence interval for y in N (u,a?) with o? 
known is X + zy/20//n. This means 


Pao {u € confidence interval } = Puo fx = By ° 


= o 
Jn SPS AX + 2a/2 ) 


vn 


sz: l—a. 


In this statement, as in all other areas of classical statistics, is a constant, 
the probability statement is about X. So (1 — æ) is the proportion of times the 
interval covers u over repetitions of the experiment and data sets. If one has a 
data set with X = 3, and asks for the probability that p lies in 3+ za/20 / Vn, 
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the answer isn’t 1 — a but trivially zero or one depending on the value of p. 
Though the idea of such intervals is quite old, it was Neyman who formalized 
them. 


For X ~ f(x|9), 6 € R, one calls (9(X),@(X)) a confidence interval with 
confidence coefficient 1 — a, if Pa {9(X) <@< 6(X)}=1-a. 


The simplest way to generate them is to find what Fisher called a pivotal 
quantity, namely, a real valued function T(X,@) of both X and @ such that 
the distribution of T(X,0) does not depend on 6. Suppose then we choose two 
numbers tı and tə such that Po {tı < T(X,0) < te} = 1 — a. If for each X, 
T(X,6@) is monotone in 6, say, an increasing function of 0, then we can find 
6(X) and 6(X) such that T(X,0(X)) = t2 and T(X,0(X)) = ty. Clearly 
(<6 <8) iff t; <T <t and hence 8 < 6 < @ with probability 1 — a. 

In the normal example, T(X,y) = X — u, the distribution of which is 
N(0,07/n). 

Neyman showed one can also derive confidence intervals from tests. We 
illustrate this with the normal. For each uo, consider the UMPU test 


oz 


Ip = O if Ho — 2a/2 7m S Xs Ho + Za/2 Fz; 
otherwise. 


(We have taken Jp = 0 at the two boundaries, which have zero probability 
anyway.) 


We now define a confidence set, say, A(X) C R by, 
A(X) = { uo such that Hp : u = po is accepted by its UMPU test}. 


Then P,,,.{ A(X) covers pio} = Puo {Ho is accepted by its UMPU test } = 1-a. 


Also A(X) is nothing but the interval X + zg /20/./n. We have just gotten 
the same interval by a different route. 

This approach helps in showing many common intervals have the prop- 
erty of being shortest, i.e., having smallest expected length of all confidence 
intervals obtainable from a family of unbiased tests. This follows from an ap- 
plication of a simple but somewhat technical result (vide Ghosh-Pratt identity 
in Encyclopedia of Statistics). 


1.5 Inference as a Statistical Decision Problem 


The three apparently very different inference problems discussed in Section 1.4 
can be unified by formulating them as statistical decision problems. This ap- 
proach is due to Wald, who not only unified classical inference but proved 
basic theorems applying to all inference. A couple of his results are mentioned 
below and in Section 2.3. One gains a certain conceptual clarity as well as a 
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certain broader outlook. However certain special features of each problem, ei- 
ther relating to historical context or relating to such consideration as intuitive 
appeal or reasonableness, are lost. 

A statistical decision problem has a model f(a!@) and a space A of actions 
or decisions “a”. A decision rule or decision function is a function 6(a) from 
the sample space of the data to the action space A, i.e., 6(a) is an action, for 
each æ. To implement this rule, one simply takes the action 6(z) if data are 
A 

In estimation A = R and (x) = T(x) € R is nothing but an estimate 
of 8. In testing, the action or decision consists of two elements { “accept Ho”, 
“accept H1” }. We may denote these elements as ag and a1. A decision function 
has a one-one correspondence with indicator function as follows 


_ f 1 iff b(a@) = ay; 
Ma) = fe iff d(x) = ao. 


In interval estimation, action space would be the collection of all intervals 
[a,b]. Each confidence interval is a decision function. 

One of the advantages of the new approach is that it liberated classi- 
cal statistics from some historical legacies like unbiasedness and in this way 
broadened it. We will discuss this particular point again in the chapter on 
hierarchical Bayes analysis. 

One more concept is needed to evaluate the performance of a decision 
function. Let the loss L(@,a) be a measure of how good the action a is when 6 
is the value of the parameter: the smaller the loss better the action a relative 
to 8. 

In estimation, a commonly used L(6,a) is the squared error loss function. 
In testing Ho : 6 = ĝo versus H; : 0 Æ ĝo, a commonly used loss is the 0-1 
loss, namely, 


0 if 6 = ĝo and a = ao or 0 Æ bo and a = a1; 
1 otherwise. 


L(6,a) = { 


In interval estimation there is no commonly used loss function. One choice 
would be a suitable penalty for length and failure to cover 6 by a chosen 
interval [a,b], e.g., 


L(6, [a, b]) = cı L1 (8, [a, b]) + c2(b — a), 


where 
1 if € fa, b); 
0 otherwise. 


1(8,[0,8)) = f 


To evaluate a decision function 6(a:), one calculates the average loss 


Eo (L(0,5(X)) = R(0,8). 


This is a function of 8. 
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How would one define an optimal decision function? In estimation, if one 
confines attention to unbiased decision functions, then R(6,6) = Var(é(X)|9). 
Sometimes one can get a single 69 minimizing R(0,8) for all 0 among all 
unbiased 6. Without the restriction to unbiasedness, this is no longer the 
case. Similar questions arise in testing and other problems also. Clearly, new 
principles are called for. We can introduce a weight function 7(@) and minimize 
the weighted risk 


Ree i R(0,5)n(6)d0. 


A 69 minimizing this is called a Bayes rule. This is a problem that we discuss 
in Chapter 2. There 7(@) is interpreted as a quantification of prior belief and 
is called a prior distribution of 6. We say ĉo is a Bayes rule in the limit (or 
Bayes in the wide sense, (Wald (1950))), if for a sequence of priors 7;, 


lim [R(7;, 69) — inf Rao) = 0. 


tO 


A somewhat conservative optimization principle is to minimize 


sup R(0, 6). 
7 


A decision rule ôo is said to be minimax if 


sup R(6, ðo) = inf sup R(6, 46). 
o 0 


A sufficient condition for a rule ĝọ to be minimax is that 69 minimizes R(z, 6) 
for some 7 and has constant risk R(8, ĝo) = c. Then 


sup R(O, ĉo) = c = R(m, ôo) 
7 


< R(x, ô) 
< sup R(O, ô). 
7 


This argument is due to Wald (1950). In Problem 16, you are asked to prove 
that if a rule ôo is Bayes in the limit and has constant risk, then it is minimax. 


1.6 The Changing Face of Classical Inference 


Because the exact theories of optimal estimates are difficult to apply, atten- 
tion has shifted to approximate algorithmic methods, like the EM algorithm, 
simulation, and asymptotics. Along with this, there has been much interest 
in robust methods that do well under a broad spectrum of models. As an 
example, we discuss the method of Bootstrap due to Efron (see Efron (1982)). 

We illustrate the method of Bootstrap by showing how to calculate, say, 
the variance of r(6) for a given 7. The original sample is (£1, 22,...,2n). We 
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sample from this data set n times at random and replacing each chosen item 
before the next draw. This produces a pseudo data set that we denote by 
(a*,x3,...,2%). We calculate 6* from this pseudo data and then r(6*). We 
repeat this N times (where N is much larger than n) to generate N pseudo 
data sets and N pseudo values of r(6), which we denote as Ti, T3,- -T0 The 
estimate for E(r(6)|@) is 7* = 5 ae and an estimate for Var(r(6)|9) is 
+ $ (T* —7*)*. There is considerable numerical and theoretical evidence that 
show the Bootstrap estimates are superior to earlier methods like the delta 
method discussed in Section 1.4. 

Finally, classical statistics has come up with many new methods for deal- 
ing with high-dimensional problems. A couple of them will be discussed in 
Chapter 9. 


1.7 Exercises 


1. Verify that N (u, 0%), exponential with f(a|6) = ze! ê Bernoulli(p), bi- 
nomial B(n, p), and Poisson P(A), each constitutes an exponential family. 

2. Verify (1.4). 

3. Assuming p = 1 in (1.4), show that a => 0, 

4. (a) Generate data by drawing a sample of size n = 30 from N(y,1) with 
u = 2. For your data, plot the likelihood function and comment on its 
shape and how informative it is about p. 

(b) For an exponential family, show that the likelihood function is log 


concave, i.e., the matrix with (i, 7)th element ch is negative definite. 
(Hint. The proof is similar to that for Problem 3. By direct calculation 


OlogL Oe | 0? log L Olog L dlog L 
60,00; 00,00; © \ 86,00, 60, 00; 








Now use the fact that a variance-covariance matrix is positive definite, 
unless the distribution is degenerate). 
(c) Let X1,..., Xn be iid. with density f(z|@), p = 1, in an exponential 
family. Show that MLE of 7 is (1/n) 57, t(X;) and hence the MLE 6 + 6 
as n —> oo. 

5. Let X1, X2,...,Xn be iid N(y, 07), with u, o? unknown. Let (u, 0?) = 
P{X, < Op, o°}. 
(a) Calculate 7(jz,¢7), where jz, and ô? are the MLE of u and o°. 
(b) Show that the best unbiased estimate of r(, 07) is 


W(X) = E (HX: < 0}|X,S?) = F(-X/S) 
where S* is the sample variance and F is the distribution function of 


(X, — X)/S. 
(c) For u = 0,0? = 1,n = 36 find the mean squared errors 


10. 


11. 


1.7 Exercises 25 


E{(r(fi, 67) —7(0, 1))*|0, 1} and E{(W(X)-—7r(0,1))?|0, 1} approximately 
by simulations. 

(d) Estimate the mean, variance and the mean squared error of 7 (ji, 67) 
by (i) delta method, (ii) Bootstrap, and compare with (c). 


. Let X 1, X2,...,Xn be iid. with density (1/o0)f((# — p)/o). Show that 


for fixed o, Pug {}_1 Xi > c} is an increasing function of u. 


. X1, X2, .. -, Xn are said to have a family of densities f(a|@) with monotone 


likelihood ratio (MLR) in T(a:) if there exists a sufficient statistic T(x) 
such that f(a|02)/f(x|61) is a non-decreasing function of T(a) if 2 > 64. 
(a) Verity that exponential families have this property. 

(b) If f(a]@) has MLR in T, show that Po{T(X) > c} is non-decreasing 


in @. 


. Let X1, X2,..., Xn be iid. N(u,1). LetO<a,6<1, A>0. 


(a) For Ho : = uo versus Hy : u > uo, show the smallest sample size n 
for which the UMP test has probability of error of first kind equal to a and 
probability of error of second kind < 8 for u > uo + A is (approximately) 
the integer part of ((za + zg) / A)" +1. 
Evaluate n numerically when A = .5, a = .01, 8 = .05. 
(b) For Ho : u = po versus Hı : u Æ Ho, show the smallest n such that 
UMPU test has probability of error of first kind equal to a and probability 
of error of second kind < £ for |u— uol > A is (approximately) the solution 
of 

P (zaj2 — VNA) +8 (zaj + Vna) =14 8. 


Evaluate n numerically when A = 0.5, a = .01 and 8 = .05. 


. Let X1, X2,..., Xn be iid U (0,0), 0 > 0. Find the smallest n such that 


the UMP test of Ho : 0 = bo against Hı : 0 > ĝo has probability of error 
of first kind equal to @ and probability of error of second kind < £ for 
0 > 6,, with 0, > 6. 

(Basu (1988, p.1)) Let X1, Xo,..., Xn be i.i.d U(@, 20), 6 > 0. 

(a) What is the likelihood function of 6? 

(b) What is the minimal sufficient statistic in this problem? 

(c) Find 6, the MLE of 0. 

(d) Let Xa) = min(Xi,...,X,) and T = (46 + Xq))/5. Show that 


) 
E ((T —0)?)/E ((6 — 0)? is always less than 1, and further, 


Ne) TENSE 


TOTE 


Suppose X1, X2,..., Xn are ii.d N(s,1). A statistician has to test Ho : 
u = 0; he selects his alternative depending on data. If X < 0, he tests 
against H; : u < 0. If X > 0, his alternative is Hj : u >Q. 

(a) If the statistician has taken a = .05, what is his real a? 

(b) Calculate his power at u = +1 when his nominal a = .05, n = 25. Will 
this power be smaller than the power of the UMPU test with a = .05” 
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12; 


13. 


14. 


15. 
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Consider n patients who have received a new drug that has reduced 
their blood pressure by amounts X1, X2,..., Xn. It may be assumed that 
X 1, Xo,..., Xn are i.i.d. N(,07) where o? is assumed known for simplic- 
ity. On the other hand, for a standard drug in the market it is known that 
the average reduction in blood pressure is uo. The company producing the 
new drug claims pz = mọ, i.e., it does what the old drug does (and probably 
costing much less). Discuss what should he Hp and H; here. (This is a 
problem of bio-equivalence. ) 

(P-values) The error probabilities of a test do not provide a measure of 
the strength of evidence against Ho in a particular data set. The P-values 
defined below try to capture that. 

Suppose Hy : 0 = Qy and your test is to reject Ho for large values of a 
test statistic W(X), say, you reject Ho if W > Wa. Then, when X = x 
is observed, the P-value is defined as 


P(x) =1— Fg (W(æ)), 


where F = distribution function of W under §. 

(a) Show that if Fj’ is continuous then P(X) has uniform distribution 
on (0,1). 

(b) Suppose you are not given the value of W but you know P. How will 
you decide whether to accept or reject Ho ? 

(c) Let X1, X2,..., Xn beii.d N(u,1). You are testing Ho : u = uo versus 
H; : u # po. Define P-value for the UMPU test. Calculate E, (P) and 
E, (PIP < a). 

(a) Let f(x|0o), f(x|01) and Io be as in Part A of the Neyman-Pearson 
Lemma. The constant k is chosen not from given a@ but such that 


Eo, (Io) = 1 — Ee, (Io). 


Then show that Jp is minimax, i.e., Iọ minimizes the maximum error 
probability, 


max(E, (Io), 1 — Eo, (Io)) < max( Ep, (1), 1 — Eo, (1). 


(b) Let X1, X2,...,Xn be i.id. N(y,1). Using (a) find the minimax test 
of Ho : u = —1 versus H; : p = +1. 

(a) Let X have density f(x|@) and O = {6 ,6;}. The null hypothesis is 
Ho : 6 = ĝo, the alternative is Hı : 6 = 6;. Suppose the error probabilities 
of each randomized test ¢ is denoted by (ag, 8g) and S= the collection of 
all points (ag, 84). S is called the risk set. Show that S is convex. 

(b) Let X be B(2, p), p = § (corresponding with Ho) or 4 (corresponding 
with Hı). Plot the risk set S as a subset of the unit square. 

(Hint. Identify the lower boundary of S as a polygon with vertices corre- 
sponding with non-randomized most powerful tests. The upper boundary 
connects vertices corresponding with least powerful tests that are similar 


to Ig in the N-P lemma but with reverse inequalities.) 


16. 


17. 


18. 


19. 


20. 
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(a) Suppose dg is a decision rule that has constant risk and is Bayes in 
the limit (as defined in Section 1.5). Show that 59 is minimax. 

(b) Consider i.i.d. observations X1,..., Xn from N(p,1). Using a normal 
prior distribution for u, show that X is a minimax estimate for u under 
squared error loss. 

Let Xi, Xo,...,Xn be iid. N(p,07). Consider estimating p. 

(a) Show that both X and the sample median M are unbiased estimators 
of u. 

(b) Further, show that both of them are consistent and asymptotically 
normal. 

(c) Discuss why you would prefer one over the other. 

Let X1, Xo,...,Xn be iid. N (u, 07), Yi, Y2, .--, Ym beiid. N(n,77) and 
let these two samples be independent also. Find the set of minimal suff- 
cient statistics when 

(a) —oo < m,n < 00, g? > 0 and 7? > 0. 

(b) p= 7, —œ < u < co, c? > O and 7? > 0. 

(c) -œ < u,n < œ, g? = T°, and o°? > 0. 

(d) u =, o? = 7%, ce < u < œ, and g? > Q0. 

Suppose X;,27 = 1,2,...,n are iid. from the exponential family with 
density (1.2) having full rank, i.e., the parameter space contains a p- 
dimensional open rectangle. Then show that (T; = X1 t (Xi) j = 
1,...,p) together form a minimal sufficient statistic. 

Refer to the ‘factorization theorem’ in Section 1.3. Show that a statistic 
U is sufficient if and only if for every pair 01, 02, the ratio f(x|62)/f(2|61) 
is a function of U(x). 
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Bayesian Inference and Decision Theory 


This chapter is an introduction to basic concepts and implementation of 
Bayesian analysis. We begin with subjective probability as distinct from clas- 
sical or objective probability of an uncertain event based on the long run 
relative frequency of its occurrence. Subjective probability, along with utility 
or loss function, leads to Bayesian inference and decision theory, e.g., estima- 
tion, testing, prediction, etc. 

Elicitation of subjective probability is relatively easy when the observa- 
tions are exchangeable. We discuss exchangeability, its role in Bayesian anal- 
ysis, and its importance for science as a whole. 

In most cases in practice, quantification of subjective belief or judgment 
is not easily available. It is then common to choose from among conventional 
priors on the basis of some relatively simple subjective judgments about the 
problem and the conventional probability model for the data. Such priors are 
called objective or noninformative. These priors have been criticized for vari- 
ous reasons. For example, they depend on the form of the likelihood function 
and usually are improper, i.e., the total probability of the parameter space is 
infinity. Here in Chapter 2, we discuss how they are applied; some answers to 
the criticisms are given in Chapter 5. 

In Section 2.3 of this chapter, there is a brief discussion of the many 
advantages of being a Bayesian. 


2.1 Subjective and Frequentist Probability 


Probability has various connotations. Historically, it has been connected with 
both personal evaluation of uncertainty, as in gambling or other decision mak- 
ing under uncertainty, and predictions about proportion of occurrence of some 
uncertain event. Thus when a person says the probability is half that this par- 
ticular coin will turn up a head, then it will usually mean that in many tosses 
about half the time it will be a head (a version of the law of large numbers). 
But it can also mean that if someone puts this bet on head — if head he wins 
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a dollar, if not he loses a dollar - the gamble is fair. The first interpretation is 
frequentist, the second subjective. Similarly one can have both interpretations 
in mind when a weather forecast says there is a probability of 60% of rain, but 
the subjective interpretation matters more. It helps you decide if you will take 
an umbrella. Finally, one can think up situations, e.g., election of a particular 
candidate or success of a particular student in a particular test, where only 
the subjective interpretation is valid. 

Some scientists and philosophers, notably Jeffreys and Carnap, have ar- 
gued that there may be a third kind of probability that applies to scientific 
hypotheses. It may be called objective or conventional or non-subjective in 
the sense that it represents a shared belief or shared convention rather than 
an expression of one person’s subjective uncertainty. 

Fortunately, the probability calculus remains the same, no matter which 
kind of probability one uses. A Bayesian takes the view that all unknown 
quantities, namely the unknown parameter and the data before observation, 
have a probability distribution. For the data, the distribution, given 8, comes 
from a model that arises from past experience in handling similar data as well 
as subjective judgment. The distribution of @ arises as a quantification of the 
Bayesian’s knowledge and belief. If her knowledge and belief are weak, she 
may fall back on a common objective distribution in such situations. 

Excellent expositions of subjective and objective Bayes approaches are 
Savage (1954, 1972), Jeffreys (1961), DeGroot (1970), Box and Tiao (1973), 
and Berger (1985a). Important relatively recent additions to the literature are 
Bernardo and Smith (1994), O’Hagan (1994), Gelman et al. (1995), Carlin and 
Louis (1996), Leonard and Hsu (1999), Robert (2001), and Congdon (2001). 


2.2 Bayesian Inference 


Informally, to make inference about @ is to learn about the unknown @ from 
data X, i.e., based on the data, explore which values of @ are probable, what 
might be plausible numbers as estimates of different components of 0 and the 
extent of uncertainty associated with such estimates. In addition to having 
a model f(a|@) and a likelihood function, the Bayesian needs a distribution 
for 0. The distribution is called a prior distribution or simply a prior be- 
cause it quantifies her uncertainty about @ prior to seeing data. The prior 
may represent a blending of her subjective belief and knowledge, in which 
case it would be a subjective prior. Alternatively, it could be a conventional 
prior supposed to represent small or no information. Such a prior is called an 
objective prior. We discuss construction of objective priors in Chapter 5 (and 
in Section 6.7.3 to some extent). An example of elicitation of subjective prior 
is given in Section 5.4. 

Given all the above ingredients, the Bayesian calculates the conditional 
probability density of 0 given X = æ by Bayes formula 


2.2 Bayesian Inference 31 


(6) (a6) 
rOle) = T OFE 


where 7(@) is the prior density function and f(æ|0) is the density of X, 
interpreted as the conditional density of X given 0. The numerator is the 
joint density of O and X and the denominator is the marginal density of X. 
The symbol @ now represents both a random variable and its value. When the 
parameter @ is discrete, the integral in the denominator of (2.1) is replaced 
by a sum. 

The conditional density 7(@|a) of 0 given X = zg is called the posterior 
density, a quantification of our uncertainty about @ in the light of data. The 
transition from 7(@) to 7(@[a) is what we have learnt from the data. 

A Bayesian can simply report her posterior distribution, or she could report 
summary descriptive measures associated with her posterior distribution. For 
example, for a real valued parameter 0, she could report the posterior mean 


(2.1) 


E(6|x) = I ~ @n(6|x)d0 


and the posterior variance 
Var (0ļæ) = E{(@ — E(6|x))*|x} 
=| (0 — E(O\x))*x(O\a)d0 


— OO 
or the posterior standard deviation. Finally, she could use the posterior distri- 
bution to answer more structured problems like estimation and testing. In the 
case of estimation of 0, one would report the above summary measures. In the 
case of testing one would report the posterior odds of the relevant hypotheses. 


Example 2.1. We illustrate these ideas with an example of inference about u 
for normally distributed data (N(,07)) with mean p and variance a7. The 
data consist of i.i.d. observations X,,X9,---,X,, from this distribution. To 
keep the example simple we assume n = 10 and g? is known. A mathemat- 
ically convenient and reasonably flexible prior distribution for u is a normal 
distribution with suitable prior mean and variance, which we denote by 7 and 
T*. To fix ideas we take 7 = 100. The prior variance 7? is a measure of the 
strength of our belief in the prior mean 7 = 100 in the sense that the larger 
the value of 7”, the less sure we are about our prior guess about n. Jeffreys 
(1961) has suggested we can calibrate 7? by comparing with a”. For example, 
setting T? = o7/m would amount to saying information about 7 is about as 
strong as the information in m observations in data. Some support for this 
interpretation is provided in Chapter 5. By way of illustration, we take m = 1. 
With a little algebra (vide Problem 2), the posterior distribution can be shown 
to be normal with posterior mean 


1 1h 1 n _ 
E(X) = (pnt R/C t a) = +X (2.2) 
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and posterior variance 


g2 


g? 
A 477) So Jii (2.3) 


i.e., in the light of the data, y shifts from prior guess 7 towards a weighted 
average of the prior guess about u and X, while the variability reduces from 
a? to o7/11. If the prior information is small, implying large 7° or there are 
lots of data, i.e., n is large, the posterior mean is close to the MLE X. 


We will see later that we can quantify how much we have learnt from the 
data by comparing (jz) and (|X). The posterior depends on both the prior 
and the data. As data increase the influence of data tends to wash away the 
prior. Our second example goes back in principle to Bayes, Laplace, and Karl 
Pearson (The Grammar of Science, 1892). 


Example 2.2. Consider an urn with Np red and N(1 — p) black balls, p is 
unknown but N is a known large number. Balls are drawn at random one 
by one and with replacement, selection is stopped after n draws. For i = 
1,2,...,n, let 


1 if the zth ball drawn is red; 
AGS 
0 otherwise. 


Then X;’s are iid B(1, p), i.e., Bernoulli with probability of success p. Let p 
have a prior distribution z(p). We will consider a family of priors for p that 
simplifies the calculation of posterior and then consider some commonly used 
priors from this family. Let 


rip) = Rona - p), 0<p<l1l;a>0,8ß>0. (2.4) 


This is called a Beta distribution. (Note that for convenience we take p to as- 

sume all values between 0 and 1, rather than only 0,1/N,2/N, etc.) The prior 

mean and variance are a/(a+ 3) and af/{(a+ B)?(a+ 84+ 1)}, respectively. 
By Bayes formula, the posterior density can be written as 


m(p|_X = x) = C(x)p*t"*(1 =p) th=- (2.5) 


where r = };_ z; = number of red balls, and (C(x))~! is the denominator 
in the Bayes formula. A comparison with (2.4) shows the posterior is also a 
Beta density with a+r in place of a and 8 + (n — r) for @ and 


Civ) =T(a+Pin)/{l(at+r)r(B+n-—-r)}. 
The posterior mean and variance are 


E(plz) = (a +r)/(a +8 +n), 


Var (plæ) Si (a T r)(B +n — t) 


(a+8+nPla+8+n+1) (2.6) 
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As indicated earlier, a Bayesian analyst may just report the posterior (2.5), 
and the posterior mean and variance, which provide an idea of the center 
and dispersion of the posterior distribution. It will not escape one’s attention 
that if n is large then the posterior mean is approximately equal to the MLE, 
p= r/n and the posterior variance is quite small, so the posterior is concen- 
trated around p for large n. We can interpret this as an illustration of a fact 
mentioned before when we have lots of data, the data tend to wash away the 
influence of the prior. 

The posterior mean can be rewritten as a weighted average of the prior 
mean and MLE. 


eC a E 5 
(a+ B+n)(a+B) la+8++n)n 


Once again, the importance of both the prior and the data comes out, the 
relative importance of the prior and the data being measured by (a + 2) and 
n. 

Suppose we want to predict the probability of getting a red ball in a new 
(n + 1)-st draw given the above data. This has been called a fundamental 
problem of science. It would be natural to use E(p|æ), the same estimate as 
above. We list below a number of commonly used priors and the corresponding 
value of E(p|X 1, Xo,---, Xn). 

The uniform prior corresponds with a = 8 = 1, with posterior mean equal 
to (X07 X; + 1)/(n + 2). This was a favorite of Laplace and Bayes but not so 
popular anymore. If a = 8 = E, we have the Jeffreys prior with posterior mean 
(S07 Xi+ 5)/(n +1). This prior is very popular in the case of one-dimensional 
6 as here. It is also a reference prior due to Bernardo (1979). Reference priors 
are very popular. If we take a Beta density with a = 0, 8 = 0, it integrates 
to infinity. Such a prior is called improper. If we still use the Bayes formula 
to produce a posterior density, the posterior is proper unless r = 0 or n. The 
posterior mean is exactly equal to the MLE. 

Objective priors are usually improper. To be usable they must have proper 
posteriors. It is argued in Chapter 5 that improper priors are best understood 
through the posteriors they produce. One might examine whether the poste- 
rior seems reasonable. 

Suppose we think of the problem as a representation of production of 
defective and non-defective items in a factory producing switches, we would 
take red to mean defective and black to mean a good switch. In this context, 
there would be some prior information available from the engineers. They 
may be able to pinpoint the likely value of p, which may be set equal to the 
prior mean a/(a+ 8). If one has some knowledge of prior variability also, one 
would have two equations from which to determine a and £. In this particular 
context, the Jeffreys prior with a lot of mass at the two end points might be 
adequate if the process maintains a high level of quality (small p) except when 
it is out of control and has high values of p. The peak of the prior near p = 1 
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Table 2.1. An Epidemiological Study 


Food Eaten 
Crabmeat No Crabmeat 


SE d|No = Salad|Potato Salad|No ee Salad 


{ll a 22 
Not Ill i 24 


could reflect frequent occurrence of lack of control or a pessimistic prior belief 
to cope with disasters. 

It is worth noting that the uniform, Jeffreys prior, and reference priors are 
examples of objective priors and that all of them produce a posterior mean 
that is very close to the MLE even for small n. Also all of them make better 
sense than the MLE in the extreme case when p = 0. In most contexts the 
estimate ia = ( is absurd, the objective Bayes estimates move it a little to- 
wards p = =, which corresponds with total ignorance in some sense. Such a 
movement _ Scales a shrinkage. Agresti and Caffo (2000) and Brown et al. 
(2003) have shown that such estimates lead to confidence intervals with closer 
agreement between nominal and true coverage probability than the usual con- 
fidence intervals based on normal approximation to p or inversion of tests. In 
other words, the Bayesian approach seems to lead to a more reasonable point 
estimate as well as a more reliable confidence interval than the common clas- 
sical answers based on MLE. 











Example 2.3. This example illustrates the advantages of a Bayesian interpre- 
tation of probability of making a wrong inference for given data as opposed 
to classical error probabilities over repetitions. In this epidemiological study 
repetitions don’t make sense. 

The data in Table 2.1 on food poisoning at an outing are taken from 
Bishop et al. (1975) who provide the original source of the study. Altogether 
320 people attended the outing, 304 responded to questionnaires. 

There was other food also but only two items, potato salad and crabmeat, 
attracted suspicion. We focus on the main suspect, namely, potato salad. A 
partial Bayesian analysis of this example will be presented later in Chapter 4. 


Example 2.4. Let X1, X2,...,Xn be i.i.d N(,07) and assume for simplicity 
g? is known. As in Chapter 1, may be the expected reduction of blood 
pressure due to a new drug. You want to test Ho : u < uo versus H; : p > Ho, 
where uo corresponds with a standard drug already in the market. 

Let m(u) be the prior. First calculate the posterior density 7(y|X ). Then 
calculate 


f “Oe PX 


and 
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OO 
[Wh X)du = 1- PEI X} = P(E |X} 
Ho 
One may simply report these numbers or choose one of the two hypotheses if 
one of the two probabilities is substantially bigger. 

We provide some calculations when the prior for ps is N (n, 7). We recall 
from Example 2.1 that the posterior for u is normal with mean and variance 
given by equations (2.2) and (2.3). If follows that 


n(n < po|X) = O(z) and m(p > u| X) = 1 — G(z) 
where ® is the standard normal distribution function and 
_ Ho- (27+ BX) /(a+ 3) 
(ZAZ 


A conventional choice is to make 7T? — 00 above, which would give the same 
result as assuming an improper uniform prior 


RL) SE, —00 < pb < &. 


Any of these would lead to 
-~ y/n 
Z = (Ho = aS 


Suppose we wish to reject if the posterior odds against Ho are 19:1 or more 
i.e., if posterior probability of Hp is < .05. Then we reject Ho if 


= o 
Ho- X < R 


= o 

or X > Ho + eS Tg 

which is exactly the same as the classical test for this problem with a = .05. 

However if we had wished to test the sharp null hypothesis Ho : y = po 

against H, : u Æ uo or Ay: u > uo, we have to choose the prior in a different 

way since the prior we chose would assign zero probability to Hp. Moreover, 

the answers tend to be very different from classical answers as we shall see in 
Chapter 6. 


2.3 Advantages of Being a Bayesian 


The Bayesian approach provides a fairly explicit solution to common problems 
of statistical inference (Chapters 2 and 8), new problems of high-dimensional 
data analysis that are coming up because of emergence of high-dimensional 
data sets (Chapters 9 and 10), as well as complex decision problems of real 
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life (Chapter 10). It can handle presence of prior knowledge or partial prior 
knowledge, specially constraints like a < @ < b relatively easily. In some cases, 
a subjective prior can be elicited (Chapter 5), and in most other cases one 
can choose objective priors. Of course, in all cases one would wish to study to 
some extent the robustness of various aspects of the posterior with respect to 
modest variation in prior as illustrated in Chapter 3. 

In classical decision theory, there are theorems due to Wald that imply 
that Bayes rules and their limits together form a complete class, i.e., any 
decision rule that is not of this form can be improved by a rule of this form. 
In a similar vein and as a sort of converse, Wald (1950) also proved that if a 
decision rule is admissible then it must be Bayes or limit of Bayes rules. There 
are various senses in which a decision rule 6 can be a Bayes rule in the limit. 

In this book, we stress objective priors, because it still seems difficult to 
elicit fully subjective priors, at least in most problems in practice. If a fully 
subjective prior is available we would indeed use it. In particular, whatever 
subjective input is available ought to be used, specially in high-dimensional 
problems. 

The Bayesian approach can be deduced from several sets of axioms. One 
such set is discussed in Section 3.3. Moreover, the subjective Bayesian ap- 
proach is free from certain paradoxes or violation of principles that are asso- 
ciated with classical statistics. These unpleasant properties are due to the fact 
that classical statistics provides either data dependent measures like P-values 
which are not easy to interpret or evaluations like risk functions or confidence 
coefficients that are obtained by integrating over the whole sample space and 
so may be absurd when a particular data set is in hand. The paradoxes can 
be quite dramatic. The objective Bayesian approach is not completely free 
from violation of some of these principles. We discuss some of these issues in 
Section 5.2. 

Bayesians usually accept as a principle that some validation in the real 
world is good whenever possible. Occasionally, a proxy for the real world 
may be found in conceptual frequentist constructions of possible real world 
scenarios and a Bayesian may seek some sort of validation in such cases. By 
validation in the real world we mean predictive ability. One may use a baseball 
or cricket or soccer player’s performance in the first half of the season to 
predict his performance in the second half. For a successful application of 
(parametric empirical) Bayes methodology, relative to classical methods, see 
Morris (1983) and Ghosh and Meeden (1997). By cross validation, one means 
that a part of data is used to make an inference and the other part to validate 
it, even if these two parts do not have a connotation of present and future as 
in the baseball example of Morris (1983). A validation of Bayesian approach 
to model selection is given in Hoeting et al. (1999). Most Bayesian papers on 
new methods offer some validation. 

It turns out that in objective Bayesian analysis one often has such frequen- 
tist validation; see, for example, the concept of probability matching priors 
(Subsection 5.1.4). Although this provides some reconciliation between the two 
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approaches as far as the decision that is made, only the objective Bayesian 
approach has a posterior and hence a data dependent method of evaluating 
the performance of the decision. 

Finally, basic Bayesian ideas and measures are easy to interpret and hence 
easy to communicate. 

One may well ask why in spite of all these advantages, an explosive growth 
and spread of the Bayesian approach has occurred only recently, in the past 
fifteen or so years. A major factor has been the arrival of MCMC (Markov 
chain Monte Carlo) in a big way and consequent advances in computation 
of posteriors for high-dimensional O and many real-life applications. A classic 
paper that ushered in these changes is Gelfand and Smith (1990). 


2.4 Paradoxes in Classical Statistics 


The evaluation of performance of an inference procedure in classical statistics 
is based on expected quantities like bias or variance of an estimate, error 
probabilities for a test, and confidence coefficients of a confidence interval. 
Such measures are obtained by integrating or summing over the sample space 
of all possible data. Hence they do not answer how good the inference is for 
a particular data set. The following two examples show how irrelevant the 
classical answers can be once the data are in hand. 


Example 2.5. (Cox (1958)) To estimate u in N(u,g°), toss a fair coin. Have 
a sample of size n = 2 if it is a head and take n = 1000 if it is a tail. An 

= 2 2 2 
unbiased estimate of u is Xn = Si, X;/n with variance = ${5 +45} ~ F 
Suppose it was a tail. Would you believe o7/4 is a measure of accuracy of the 
estimate? 


Example 2.6. (Welch (1939)) Let X1, X2 be iid. U(@—4,6+4). Let X +C be 
a 95% confidence interval, C > 0 being suitably chosen. Suppose X; = 2 and 
Xə = 1. Then we know for sure 6 = (X,+X2)/2 and hence 8 € (X—C, X +C). 
Should we still claim we have only 95% confidence that the confidence interval 
covers @? 


One of us (Ghosh) learned of this example from a seminar of D. Basu 
at the University of Illinois, Urbana-Champaign, in 1965. Basu pointed out 
how paradoxical is the confidence coefficient in this example. This perspective 
doesn’t seem to be stressed in Welch (1939). The example has been discussed 
many times, see Lehmann (1986, Chapter 10, Problems 27 and 28), Pratt 
(1961), Kiefer (1977), Berger and Wolpert (1988), and Chatterjee and Chat- 
topadhyay (1994). 

Fisher was aware of this phenomenon and suggested we could make in- 
ference conditional on a suitable ancillary statistic. In Cox’s example (Exam- 
ple 2.5), it would be appropriate to condition on the sample size and quote the 
conditional variance given n = 1000 as a proper measure of accuracy. Note 
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that n is ancillary, its distribution is free of 8, so conditioning on it doesn’t 
change the likelihood. In Welch’s example (Example 2.6), we could give the 
conditional probability of covering 0, conditional on X1 — X2 = 1. Note that 
Xı — Xə is also an ancillary statistic like n in Cox’s example, it contains no 
information about @ — so fixing it would not change the likelihood — but its 
value, like the value of n, gives us some idea about how much information there 
is in the data. You are asked to carry out Fisher’s suggestion in Problem 4. 

Suppose you are a classical statistician and faced with this example you are 
ready to make conditional inference as recommended by Fisher. Unfortunately, 
there is a catch. Classical statistics also recommends that inference be based on 
minimal sufficient statistics. These two principles, namely the conditionality 
principle (CP) and sufficiency principle (SP) together have a far reaching 
implication. Birnbaum (1962) proved that they imply one must then follow 
the likelihood principle (LP), which requires that inference be based on the 
likelihood alone, ignoring the sample space. A precise statement and proof are 
given in Appendix B. 

Bayesian analysis satisfies the likelihood principle since the posterior de- 
pends on the data only through the likelihood. Most classical inference pro- 
cedures violate the likelihood principle. 

Closely related to the violation of LP is the stopping rule paradox in 
classical inference. There is a hilarious example due to Pratt (Berger, 1985a, 
pp. 30-31). 


2.5 Elements of Bayesian Decision Theory 


We can approach problems of inference in a mathematically more formal way 
through statistical decision theory. This would make the problems somewhat 
abstract and divorced from the real-life connotations but, on the other hand, 
provides a unified conceptual framework for handling very diverse problems. 

A classical statistical decision problem, vide Section 1.5, has the following 
ingredients. It has as data the observed value of X, the density f(x|@) where 
the parameter @ lies in some subset © (known as the parameter space) of the 
p-dimensional Euclidean space R”. It also has a space A of actions or decisions 
a and a loss function L(@,a@) which is the loss incurred when the parameter 
is 0 and the action taken is a. The loss function is assumed to be bounded 
below so that integrals that appear later are well-defined. Typically, L(0, a) 
will be > 0 for all @ and a. We treat actions and decisions as essentially the 
same in this framework though in non-statistical decision problems there will 
be some conceptual difference between a decision and the action it leads to. 
Finally it has a collection of decision functions or rules (x) that take values 
in A. Suppose (x) = a for given x. Then the statistician who follows this 
particular rule 6(a) will choose action a given this particular data and incur 
the loss L(0, a). 
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Both estimation and testing are special cases. Suppose the object is to 
estimate 7(@), a real-valued function of 0. Then A = R, L(0@,a) = (a—7(6))? 
and a decision function 6(a) is an estimate of 7(@). If it is a problem of testing 
Ho : 0 = ĝo versus H; : 0 Æ @, say, then A = {ao,a1} where a; means the 
decision to accept H;, L(0,a;) = 0 if 8 satisfies H; and L(@,a;) = 1 otherwise. 
If I(x) is the indicator of a rejection region for Hp, then the corresponding 
d(x) is equal to a, if (a) = j, j = 0,1. 

We recall also how one evaluates the performance of 6(a) in classical statis- 
tics through the average loss or risk function 


R(0,5) = Eo(L(0,5(X)). 


If 6 is an estimate of 7(@) in an estimation problem, then R(0, ô) = Ee(r(@) — 
6(X))? is the MSE (mean squared error). If 6 is the indicator function of an 
Ho-rejection region, then R(@,6) is the probability of error of first kind if 
@ € Op and probability of error of second kind if 8 € O4. 

For a Bayesian, 0 is a random variable with prior distribution 7(@) before 
seeing the data, for example, at the planning stage of an experiment. The 
relevant risk at this stage is the so-called preposterior risk 


| R(0,5)n(8)a0 = R(x, 5). 
O 


It depends on ô and the prior. On the other hand, after the data are in hand, 
the relevant distribution of @ is given by the posterior density 7(@|a) and the 
relevant risk is the posterior risk 


E(L(@,a)\x) = W(a,a). 


The posterior risk associated with ô is ~(a,6(a)). So, in principle, there are 
two Bayesian decision problems. 


A. Given X = a, choose an optimal a, i.e., choose an a to minimize (a, a). 
B. At the planning stage, choose an optimal 6(X), denoted as 6, and called 
the Bayes decision rule or simply the Bayes rule, to minimize R(z, ô). 


We have the following pleasant fact, which shows in a sense both problems 
give the same answer for a given X. 


Theorem 2.7. (a) For any 6, 
R(x, ô) = E(X, 6(X))). 
b) Suppose a(x) minimizes w(x, a), ie. 


w(x,a(x)) = inf y(æ, a). 


Then the decision function a(x) minimizes R(n, 6). 
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Proof. (a) Because E(L(@,6(X))|X) = w(X,6(X)) by definition of %, the 
result follows by taking expectations on both sides. 

(b) Let a(x), as defined in the theorem, be denoted by 5p. Then, by part (a), 
and definition of a(x), 


R(x, ĉo) = E(y(X, a(X)) 
< E(w(X,6(X)) for any 6, 
= Hrb); 


so that R(t, ôo) = infs R(m, ô), as claimed. O 


This fact will be used below in the sections on estimation, interval estima- 
tion and testing. 


2.6 Improper Priors 


For point and interval estimates and to some extent in testing, objective priors 
are often improper. We have considered an improper prior for pz in N(,07) 
earlier in Examples 2.1 and 2.4 but somewhat indirectly. Also, one of the 
Beta priors in Example 2.2 was improper. We discuss a few basic facts about 
improper priors. We follow Berger (1985a). 

An improper prior density 7(@) is non-negative for all 0 but 


| 7(0)d(O) = œ. 
o 


Such an improper prior can be used in the Bayes formula for calculating the 
posterior, provided the denominator is finite for all x (or all but a set of x 
with zero probability for all 8), i.e., 


[ n(0)f(x|0)dð < oo. 
e 


Then the posterior density 7(@|X = a) is a proper probability density func- 
tion and can be used at least in inference problems or the posterior decision 
problem where we define and minimize y(x, a). However, for improper priors 
usually R(r, ô) is not used. 

The most common improper priors are 


m(p) =C, HO < H < œ, 


1 
TNO) Oi 16-00); 


for location and scale parameters. Both the improper priors may be inter- 
preted as a sort of limit of the proper priors: 


2.7 Common Problems of Bayesian Inference Al 


_ J1/(2L) if-L<p< L; 
71,1 (M4) = l 0 otherwise, 


eye A/o if0<1/L<o<L; 
M2,L\0) =) 0 otherwise, 


where A = 1/(2log L), in the sense that the posteriors for 7; and m2 may be 
obtained by making L > œ in 7; L(0|X ). Also, as pointed out by Heath and 
Sudderth (1978), the posteriors for 7; are same as the posteriors for suitably 
chosen proper but finitely additive priors. 


2.7 Common Problems of Bayesian Inference 


There are three common problems, as in classical statistics, namely, point 
estimation, interval estimation, and testing. We have already seen examples of 
point estimates and tests of one-sided hypotheses, so we begin with these two 
problems and then turn to interval estimates (credible intervals) and testing 
of a sharp null hypothesis. Testing a sharp null hypothesis will be illustrated 
with a popular Bayes test for the normal mean due to Jeffreys. We also discuss 
prediction and a few other topics related to testing and interval estimation. 

Because the differences between Bayesian inference and Bayesian decision 
theory is mainly one of nuances, we do not make any sharp distinctions be- 
tween the two approaches. So our treatment of these three problems as well 
as other problems later includes elements of both - loss functions from deci- 
sion theory as well as evidential descriptive measures from inference. A full 
Bayesian study of a problem consists of two stages, the planning or prepos- 
terior stage followed by posterior Bayesian analysis of data collected. At the 
planning stage one would have problems of choosing optimum design and op- 
timum sample size. Then the integrated Bayes risk R(a) = infs R(z, ô) plays 
a central role. 

In this book we concentrate on the posterior Bayes analysis of data. 





2.7.1 Point Estimates 


For a real valued @, standard Bayes estimates are the posterior mean or the 
posterior median. The posterior mean is the Bayes estimate corresponding 
with squared error loss and the posterior median is the Bayes estimate for 
absolute deviation loss. Along with the posterior mean one reports the poste- 
rior variance or its square root, the posterior standard deviation of @. If one 
chooses to work with the posterior median, it would be convenient to report a 
couple of other posterior quantiles to give an idea of the posterior variability 
of 0. One could report at least the first and third posterior quartiles. 

If the posterior is unimodal then the posterior mode is another choice. It is 
similar to the MLE of classical statistics. Indeed if the prior is uniform, both 
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are identical. Along with the posterior mode one can report a suitable highest 
posterior density (HPD) credible interval as a measure of posterior variability. 
If the parameter is a vector, common choices for reporting are the posterior 
mean vector and the posterior dispersion matrix. Again if the posterior is 
unimodal, one can report the posterior mode with a suitable HPD credible 
set. Problem 14 illustrates this with a multivariate normal model with known 
dispersion matrix and a multivariate normal or uniform prior for the normal 
mean vector. 


2.7.2 Testing 
We want to test 
Ho : 0 € Oo versus H : 0 € Oj. (2.7) 


If Oo and O; are of the same dimension as for one-sided null and alternative 
hypotheses, it is convenient and easy to choose a prior density that assigns 
positive prior probability to Oo and ©. One then calculates the posterior 
probabilities P{O;|x} as well as the posterior odds ratio (or simply posterior 
odds), namely, 


P{Oo|\x}/P{Oi|x} 


that most people prefer. One would then find a threshold like 1/9 or 1/19, 
etc. to decide what constitutes evidence against Hp. The Bayes rule for 0-1 
loss is to choose the hypothesis with higher posterior probability. 

There is a conceptual problem with this approach. If the prior is improper, 
then the prior probabilities may be undefined — they are, strictly speaking, 
undefined in the example with one-sided null and alternatives. Even if the 
prior is proper, the prior probabilities assigned to 0;, i.e., P(Q;) may not be 
carefully chosen and so may not be satisfactory. Surely, if our attitude to Ho 
is still as in classical Statistics, namely, that it should not be rejected unless 
there is compelling evidence to the contrary, then it would be unreasonable to 
assign less prior probability to Op than O,. In fact an objective or impartial 
choice would be to assign equal probabilities. These things can be done better 
if we use the following alternative way of specifying the prior. 

Let a and 1 — mo be the prior probabilities of Og and O,. Let g;(@) be 
the prior p.d.f. of @ under ©;, so that 


|. si(oya0 =]. 


t 


The prior in the previous approach is nothing but 
7(@) = TogolO {0 = Oo} ar (1 = 7™)91(O)I{O € QO}. (2.8) 


We do not require any longer that O9 and ©, are of the same dimension. So 
in principle, sharp null hypotheses are also covered. We can now proceed as 
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before and report posterior probabilities or posterior odds. To compute these 
posterior quantities, note that the marginal density of X under the prior m 
can be expressed as 


= I f(z|@)n(0) de 
O 


z J f(210)g0(0) dO + (1 — m0) J f(2l0)g1(0)dð (2-9) 
Oo C1 


and hence the posterior density of 8 given the data X = x as 





_ f(z@)n(@) Tof (x|@)go(@)/ma(a) if @ € Oo; 
OR eee mae a mi On mT HOCOn Oo 
It follows then that 
P” (Hola) = P (Oole) = =z J F(x18)g0(6) d6 
_ To Jat f(x|@)g0(@) dé and 


To Jo, f (x|@)g0(8 )d@ + (1 — To) Jad (x|0)g (0 ) d0 
T — pr r) = (=ni r 
P7 (Hila) = P*(Oila) = EZA | s(eI0)o1(0) a0 


7 (1 — m0) fo, f(xi@)g1 (0) d 
To fo, F(210)g0(0) dO + (1 — 70) fo, F(210)g1 (0) a0 





One may also report the Bayes factor, which does not depend on ro. The 
Bayes factor of Ho relative to Hı is defined as 


Jo, F(x]0)go(0) dé 


BF o 
a Jo, F(x\@)g1 (0) d0 


(2.11) 


Clearly, BFio = 1/BFpı. The posterior odds ratio of Ho relative to Hy is 


To 
BF 
(= -) Ol; 


which reduces to BFo; if to = 5. Thus, BFo, is an important evidential 
measure that is free of mo. The smaller the value of BFo,, the stronger the 
evidence against Ho. 

Let us consider an example to illustrate some of these measures. It will be 


extended to include the well-known Jeffreys’ analysis later. 





Example 2.8. Consider a blood test conducted for determining the sugar level 
of a person with diabetes two hours after he had his breakfast. It is of interest 
to see if his medication has controlled his blood sugar levels. Assume that 
the test result X is N(@,100), where @ is the true level. In the appropriate 
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population (diabetic but under this treatment), 0 is distributed according to 
a N(100, 900). Then, marginally X is N(100, 1000), and the posterior distri- 
bution of 0 given X = x is normal with 

mean = 2O r + 42100 = 0.9x + 10 and variance = 109x290 = 90. 

Suppose we want to test Hp : 0 < 130 versus Hı : 0 > 130. If the blood test 
shows a sugar level of 130, what can be concluded? Note that, given this test 
result, the true mean blood sugar level (0) may be assumed to be N (127,90). 


Consequently, we obtain, 


130 — 127 


v 90 
P(@ > 130| X = 130) = 0.376. Therefore, 


Posterior odds ratio = 0.624/0.376 = 1.66. 


P(0 < 130|X = 130) = ( ) = 6(.316) = 0.624, and hence 


Because mo = P7(6 < 130) = (5) = &(1), the prior odds ratio is 
&(1)/(1 — &(1)) = .8413/.1587 = 5.3, and thus the Bayes factor turns out to 
be 1.66/5.3 = .313. 

It can also be noted here that in one-sided testing situations when a contin- 
uous prior 7 can be specified readily for the entire parameter space, there is no 
need to express it in the form of n(0) = mogo(@)I{@ € Oo} +(1—7o) gi (P)I{6 € 
O,}. However, the problem of testing a point null hypothesis turns out to be 
quite different as shown below. 


Testing a Point Null Hypothesis 
The problem is to test 
Ho : 0 = ĝo versus H; : 6 4%. (2.12) 


Consider the following examples, which indicate when we need to consider 
point nulls and when we need not. 


Example 2.9. In a statistical quality control situation, @ is the size of a unit 
and acceptable units are with 8 € (6) — 6,4) + ô). Then one would like to test 


Hos : |8 — Go| < 6. 


In this problem the length of the interval, 26, can be explicitly specified. On 
the other hand, this is not the case in the following. 


Example 2.10. (i) Suppose we want to test the hypothesis, 
Ho: Vitamin C has no effect on the common cold. 


Clearly this is not meant to be thought of as an exact point null; surely 
vitamin C has some effect, though perhaps a very minuscule effect. Thus, in 
reality, this is still the case of an interval null hypothesis, with a very small 
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unspecified interval. However, it would be better represented as a point null 
hypothesis. 
(ii) On the other hand, a hypothesis such as 


Ho: Astrology cannot predict the future 


can perhaps be represented as an exact point null. 


Since these issues are important, we summarize the main points below. 
If the interval in an interval null hypothesis, along with ro, go, and gı can 
be specified, it is best to treat the problem as an interval null hypothesis 
problem and proceed accordingly. However, when the interval around @p is 
small but unspecified, and gp is difficult to specify, it is best to approximate the 
interval null by a point null. Conceptually testing a point null is not a different 
problem, but there are complications. First of all, it is not possible to use a 
continuous prior density because any such prior will necessarily assign prior 
probability zero to the null hypothesis. Consequently, the posterior probability 
of the null hypothesis will also be zero. Intuitively, this is clear: if the null 
hypothesis is a priori impossible, it will remain so a posteriori also. ‘Therefore, 
a prior probability of mo > 0 needs to be assigned to the point ĝo and the 
remaining probability of mı = 1 — mro will be spread over {0 + ĝo} using a 
density gi. Simply take go to be a point mass at ĝo in (2.8). If the point null 
hypothesis approximates an interval null hypothesis, Ho : 6 € (0o — €, 99 + €), 
then 79 is the probability assigned to the interval (fo —€, 9g +€) by a continuous 
prior. The complication now is that the prior m is of the form 


n(0) = rol {0 = bo} + (1 — 70) 91 (A) 1{9 F Go} (2.13) 


and hence has both discrete and continuous parts. However, (2.9) and (2.10) 
yield, 


m(x) = mo f(x|\60) + (1 — m0)mi (z), (2.14) 


where 
mı(z) = f (x6) 91 (0) dé. 
A409 
Therefore, from (2.10), 
f (x|9o)70 
m(x) 


To f (x|8o) 
To f(z|8o0) + (1 — ro)m (z) 


fı pe Fen \ | (2.15) 


T(olx) = 


I 
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It follows then that the posterior odds ratio is given by 


m(Oo\x) _ To _ f(ælêo) 
1—n(@|x) (1—70) mi(x)’ 








and hence the Bayes factor of Ho relative to Hı (which is the ratio of the 
above posterior odds ratio to the prior odds ratio of 79/(1 — 7)) is 


B= B(«c) = BFoi (2) = Fielo) (2.16) 


milz) 


Thus, (2.15) can be expressed as 





—1 
=a BF (a)} (2.17) 
To 


n(Oo|x) = fı $ 


Example 2.11. Suppose X ~ B(n,0) and we want to test Ho : 0 = 0o versus 
Hı : 0 Æ o, a problem similar to checking whether a given coin is biased 
based on n independent tosses (where 9 will be taken to be 0.5). Under the 
alternative hypothesis, suppose @ is distributed as Beta(a, 3). Then m(x) is 


given by 
_ [n\ P(at+B) Pataz)P(B+n- r) 
m= ( \raira Teter 

so that 

oe EE n\ T(a+ 8) Tr(a+r) (8 +n- z) 
A (") ey Mon G Fora Ta+Btn) ) 
aigan l Ore) TETO ERT) 
Ean (rare P(a+G+n) 
I'(a)I'(@) r(a+8+n) 


7 I'(a+ 8) resnnesa=a a) ae 


Hence, we obtain, 





ies si 
1(8o|x) {1 + _ BF; (e)} 


Pats) r(a+z) r (B+n-2) ) 7} 


1 — no Tr)  TlatbEn) 
TO og (1 = n 





1+ 


Further discussion on hypothesis testing will be deferred to Chapter 6 
where basic aspects of model selection will also be considered. 


Jeffreys Test for Normal Mean with Unknown o? 


Suppose the data consist of i.i.d. observations X1, X2,..., Xn from a normal 
N(u, 07) distribution. We want to test Ho : u = po versus Hy : u # po, where 
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Ho is some specified number. Without loss of generality, we assume pio = 0. 
Note that the parameter g? is common in the two models corresponding to Ho 
and Hı and p occurs only in H1. Also, in this example ps and g? are orthogonal 
parameters in the sense that the Fisher information matrix is orthogonal. 
In such situations, Jeffreys (1961) suggests using (improper) objective prior 
only for the common parameter and using default proper prior for the other 
parameter that occurs only in one model. Let us consider the following priors 
in our example. We take the prior golo) = 1/o for o under Hp. Under H1, we 
take the same prior for g and add a conditional prior for u given c, namely 


l1 u 
glulo) = 7927) 
where go(-) is a p.d.f. An initial natural choice for go is N (0, c°). Thus the prior 
conditional variance of pz is calibrated with respect to g? as recommended by 
Jeffreys. Usually, one takes c = 1. 

Jeffreys points out that one would expect the Bayes factor BFo, should 
tend to zero if Z => co and s? = -+ Ð (z; — Z)? is bounded. He gives an 
argument that implies that unless gı has no finite moments, this will not 
happen. In particular, with g2 = normal, it can be verified (Problem 12) 
directly that BFo, doesn’t tend to zero as above. Jeffreys suggested we should 
take g2 to be Cauchy. So the priors recommended by Jeffreys are 


l 
golo) = under Ho 
and 
1 1 
o on(l + p?/o7) 
One may now find the Bayes factor BFo1 using (2.11). Let the joint density 


of X1,...,X, under N(z,07) model be denoted by f(21,...,%n|u,07). Then 
B Foi is given by 


1 
gilu, o) = ~ 9 (ule) = under H4. 


f cae pes Sn)0; o?)golo)do 


| E EE A EAS 
°t F Fær- Enla, 0?) gi (u, 0) dudo 


where go(o) and gı (u, c) are as given above. The integral in the numerator of 
B Fo; can be obtained in closed form. However, no closed form is available for 
the denominator. To calculate this one can proceed as follows. The Cauchy 
density gı (u|) can be written as a Gamma scale mixture of normals 


= eG 1/2 26°F) 2 ( VT AER) d 
gi(plo) J F e Jin T 


where 7 is the mixing Gamma variable. Then to calculate the denominator 
of BFo,, one can integrate over u and o in closed form. Finally, one has a 
one-dimensional integral over 7 left. 
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Example 2.12. Einstein’s theory of gravitation predicts that light is deflected 
by gravitation and specifies the amount of deflection. Einstein predicted that 
light of stars would deflect under gravitational pull of the sun on the nearby 
stars, but the effect would be visible only during a total solar eclipse when 
the deflection can be measured through apparent change in a star’s position. 
A famous experiment by a team led by British astrophysicist Eddington, im- 
mediately after the First World War (see Gardner, 1997), led to acceptance 
of Einstein’s theory. Though many other better designed experiments have 
confirmed Einstein’s theory since then, Eddington’s expedition remains his- 
torically important. There are four observations, two collected in 1919 in Ed- 
dington’s expedition, and two more collected by other groups in 1922 and 
1929. The observations are zı = 1.98, x2 = 1.61, z3 = 1.18, 24 = 2.24 (all in 
seconds as measures of angular deflection). Suppose they are normally dis- 
tributed around their predicted value u. Then X1,---,X4 are independent 
and identically distributed as N(,07). Einstein’s prediction is u = 1.75. We 
will test Hp : u = 1.75 versus H; :  # 1.75, where o? is unknown. 

If we use the conventional priors of Jeffreys to calculate the Bayes factor 
B Fo: in this example, it turns out to be 2.98 (Problem 7). Thus the calcula- 
tions with the given data lend some support to Einstein’s prediction. However, 
the evidence in the data isn’t very strong. This particular experiment has not 
been repeated because of unavoidable experimental errors. There are now 
better confirmations of Einstein’s theory, vide Gardner (1997). 


2.7.3 Credible Intervals 


Bayesian interval estimates for @ are similar to confidence intervals of classical 
inference. They are called credible intervals or sets. 


Definition 2.13. For 0 <a < 1, a 100(1 — a)% credible set for @ is a subset 
C C O such that 
P{C|X =2}=1-a. 


Usually C is taken to be an interval. Let 6 be a continuous random variable, 
6 6) be 100a,;% and 100(1 — az)% quantiles with a, + ag = a. Let C = 
(a), 6]. Then P(C|X = x) = 1—a. Usually equal tailed intervals are chosen 
sO Q1 = a2 = a/2. 

If @ is discrete, usually it would be difficult to find an interval with exact 
posterior probability 1 — a. There the condition is relaxed to 


P(C|X =z)>1l-a 


with the inequality being as close to an equality as possible. In general, one 
may use a conservative inequality like this in the continuous case also if exact 
posterior probability 1 — æ is difficult to attain. 

Whereas the (frequentist) confidence statements do not apply to whether 
a given interval for a given x covers the “true” @, this is not the case with 
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credible intervals. The credibility 1—a of a credible set does answer a layman’s 
question on whether the given set covers the “true” @ with probability 1 — a. 
This is because in the Bayesian approach, “true” @ is a random variable with 
a data dependent probability distribution, namely, the posterior distribution. 

For arbitrary priors, these probabilities will usually not have any frequency 
interpretation over repetitions like confidence statements. But for common 
objective priors, such statements are usually approximately true because of the 
normal approximation to the posterior distribution (see Chapter 4). Moreover, 
the approximations are surprisingly accurate for the Jeffreys prior. You are 
invited to verify this in Problem 8. Some explanation of this comes from the 
discussion of probability matching priors (Chapter 5). 

The equal tailed credible interval need not have the smallest size, namely, 
length or area or volume whichever is appropriate. For that one needs an HPD 
(Highest Posterior Density) interval. 


Definition 2.14. Suppose the posterior density for 0 is unimodal. Then the 
HPD interval for @ is the interval 


Csen OA =o) ke 
where k is chosen such that 
P(C|X =o) >] 1L— a; 


Example 2.15. Consider a normal prior for mean of a normal population with 
known variance g. The posterior is normal with mean and variance given by 
equations (2.2) and (2.3). The HPD interval is the same as the equal tailed 
interval centered at the posterior mean, 


C = posterior mean + Za/2 posterior s.d. 


Credible intervals are very easy to calculate unlike confidence intervals, 
the construction of which requires pivotal quantities or inversion of a family 
of tests (Chapter 1, Section 1.4.3). 

For a vector 8, one may consider a HPD credible set, specially if the 
posterior is unimodal. Alternatively, one may have credible intervals for each 
component. One may also report the probability of simultaneous coverage of 
all components. 


2.7.4 Testing of a Sharp Null Hypothesis Through Credible 
Intervals 


Some Bayesians are in favor of testing, say, Ho : 0 = ĝo versus Hı : 6 Æ Oo 
by accepting Ho if 6) belongs to a chosen credible set. This is similar to the 
relation between confidence intervals and classical testing, except that there 
the tests are inverted to get confidence intervals. This must be thought of as 
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a very informal way of testing. If one really believes that the sharp null is a 
well-formulated theory and deserves to be tested, one would surely want to 
attach a posterior probability to it. That is not possible in this approach. 

Because the inference based on credible intervals often has good frequency 
properties, a test based on them also is similar to a classical test. This is 
in sharp contrast with inference based on Bayes factors or posterior odds 
(Section 2.7.2 and Chapter 6). 


2.8 Prediction of a Future Observation 


We have already done this informally earlier. Suppose the data are 71,---, Zn, 
where X1,..., Xn are i.i.d. with density f(z|@), e.g., N(, 07) with o? known. 
We want to predict the unobserved Xn+1 or set up a predictive credible in- 
terval for Xn+1. 

Prediction by a single number t(x1,---,2%n) based on 21,---,2, with 
squared error loss amounts to considering prediction loss 


E{(Xnai—t) |e} = E (Xn — E(Xnsile)) - (t - E(Xnsile))}? |e] 
= E{ (Xai — E(Xn4ile))?|@} + (t - E(Xngile))? 


which is minimum at 


C= E(Xn41|2). 


To calculate the predictor we need to calculate the predictive distribution 
A Gi4 |e) = f T(Ln+1|e, O)7(O\ax) dO 
Ə 


= J f(&n4110)1(0|£) d0. 
O 


OO 


Let u(@) = I x f(x|0) dx. It can be shown that 


—co 


E(Xnsile) = Bu(6)|e) = | w(@)n(@)2) d8 


OO 


and hence for the normal problem the predictor is / ur(u|x) dp = posterior 


mean of p. 
Similarly in Example 2.2, the predictive probability that the next ball is 


red is 
a+r 


E(Xn+1|@) = E(plx) = atB+n 
where r = `} qti. 

A predictive credible interval for Xn+1 is (c,d) where c and d are 1001% 
and 100(1 — a2)% quantiles of the predictive distribution of X,4, given z. 
Usually, one takes a; = a2 = a@/2 as for credible intervals. 
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2.9 Examples of Cox and Welch Revisited 


In both these problems (see Examples 2.5 and 2.6), the parameter is a location 
parameter. A common objective prior is 


m(@) = constant ,—0co < 0 < œ. 


You can verify (Problem 11) that the objective Bayesian answers, namely, pos- 
terior variance in Cox’s example and posterior probability in Welch’s example, 
agree with the corresponding conditional frequentist answers recommended by 
Fisher. This would typically be the case for location and scale parameters. 


2.10 Elimination of Nuisance Parameters 


In problems of testing and estimation, the main object of interest may be not 
the full vector 0 but one of its components. Which component is important will 
depend on the context. To fix ideas let 0 = (81,02) and 6, be the parameter of 
importance. The unimportant parameters @2 are called nuisance parameters. 

Classical statistics has three ways of eliminating nuisance parameters 02 
and thus simplifying the problem of inference about @,. We explain through 
three examples. 


Example 2.16. Suppose Xı and Xə are independent Poisson with mean A1, Ao. 
You want to test Ho : A; = Ag. We can reparameterize (1, Ag) as 0, = ste: 
> = A, + Ag. Then @; is the parameter of interest. Under Ho, 6) = Z, only 
fə is the unknown parameter. T = X1 + Xə is sufficient for A; + À> and the 
conditional distribution of X; given T is binomial(n = T, p = 1/2), which can 
be used to construct a conditional test. 


Example 2.17. In the second example we use an invariance argument. Consider 
a sample from N(p,07). We want to test Hp : u = 0 against say Hy, : p > 
0, which can be reformulated as Ho : y/o = 0 and Hı : w/o > 0. Again 
reparameterize as (0; = p/o, 02 = o). Note that (X,S? = +, X (X; — X)?) 
is a sufficient statistic and X/S is invariant under the transformation 


Xi > cXj, A een, 


So X/S = Z/S,, where Z; = X;/o depends only 61. The usual t-test is based 
on X/S. 


Example 2.18. In the third method one constructs what is called a profile 
likelihood for 6; by maximizing the joint likelihood with respect to @2 and 
then using it as a sort of likelihood for 81. Thus the profile likelihood is 


L(A) = sap f(xl01, 02) = f(x|61, ĝ2(01)) 


where 62(8;) is the MLE of @2 if 6; is given. 
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In a full Bayesian approach, a nuisance parameter causes no problem. One 
simply integrates it out in the joint posterior. Suppose, however, that one 
does not want to do a full Bayesian analysis but rather construct a Bayesian 
analogue of profile likelihood, on the basis of which some exploratory Bayesian 
analysis for 0; will be done. Once again, this is easy. One uses 


L(0,) = | F161, 62)x(82/8:) d02. 


We give an example to indicate that integration makes better sense than 
treating the unknown @2 as known and equal to the conditional MLE @2(6,). 


Example 2.19. (due to Neyman and Scott). Let Xj1, Xi2,1 = 1,2,...,n, be 
2n independent normal random variables, with X;1, Xj being i.i.d. N(p;, 07). 
Here g? = 6, is the parameter of interest and (j11,..., Un) = 02 is the nuisance 
parameter. One may think of a weighing machine with no bias but some 
variability; u; is the weight of ith object, X;1, X;2 are two measurements of 
the weight of the ith object. The profile likelihood is 


_2n ly 
L (o°) « sup o~” exp (sts 2 {(Xa — m)? + (Xiz - 13) 


ive 
a Ly r 7 
x o7?” exp (sts D {(Xa — Xi)? + (Xi — xa) , 


where X; = (Xi + Xj2)/2. If one maximizes it to get an estimate of 64, it will 
be the usual MLE of o?, namely, 


` {(Xa — Xi)? + (X — Xi)*}. 


i=1 


1 

2n 

It is easy to show (Problem 13) that the estimate is inconsistent; it con- 
verges in probability to 07/2. If one corrects it for its bias by dividing by n, in- 
stead of 2n, it becomes consistent. To rectify problems with profile likelihood, 
Cox and Reid (1987) have considered an asymptotic conditional likelihood, 


which behaves better than profile likelihood. 
The simple-minded Bayesian likelihood is 


L(o”) 


— ‘ee exp (-z 2 {(Xa SS ui) aa (Xiz Jj m} (plo) du 


1 Š k : 
EAD (-z S {Xa X) + (Xe - x*}) 
4=1, 


where pt = (f41,..., Hn) has an improper uniform prior. Maximizing it one 


gets a consistent estimate of a7. 
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Berger et al. (1999) discuss many such examples with subtle problems of 
lack of identifiability of the parameters in the model. For example, if one has 
two binomials, B(n;,p;), 7 = 1,2 where n; is large, p; is small, and np, = 
n2p2 = A, then both will be well approximated by a Poisson with mean A. So 
data would provide a lot of information on A = n;p; but not so on (nj, pi). If 
both parameters of binomial, namely, n and p are unknown, then they may 
have identifiability problems in this sense. 


2.11 A High-dimensional Example 


Examples discussed so far have one thing in common — the dimension of 
the parameter space is small. We refer to them as low-dimensional. Many of 
the new problems we have to solve today have a high-dimensional parameter 
space. We refer to them as high-dimensional. One such example appears below. 


Example 2.20. New biological screening experiments, namely, microarrays, 
test simultaneously thousands of genes to identify their functions in a par- 
ticular context (say, in producing a particular protein or a particular kind of 
tumor). On the basis of the data some genes, usually in hundreds, are consid- 
ered “expressed” if they are thought to have this function. They are taken up 
for further study by more traditional and time-consuming techniques. Without 
going into the fascinating biochemistry behind these experiments, we provide 
a statistical formulation. 

The data consist of (X;,5;), i = 1,2,...,p where X;, S; are the sample 
mean and s.d. based on raw data X;1, Xi2,..., Xir of size r on the ith gene. 
For fixed i, Xj1,...,Xip are iid. N (i, aay, Further, u; = 0 if the ith gene is 
not expressed and 4; Æ 0 if the gene is indeed expressed. Of course, we could 
carry out a separate t-test for each z but this ignores some additional infor- 
mation that we can get by considering all the genes together in a Bayesian 
way. Moreover, a simple-minded testing for each gene separately would in- 
crease enormously the number of false discoveries. For example, if one tests 
for each i with a = 0.05, then even if no genes are really expressed there 
would be Na false rejections of the null hypothesis of “no expression”. We 
put a prior on s and o?’s as follows. We assume that (ui, o2), i = 1,2,...,p 
are i.i.d. given certain hyper-parameters. The prior distribution for H;i, given 
o? is mixture of two normals pN(0,co?) + (1 — p)N(6,ca?) and o? are iid. 
inverse Gamma. The prior distribution has five (hyper) parameters, namely, 
p, c, 0 and the shape and scale parameters. If the proportion of genes expected 
to be functional can be guessed, we would set p to be equal to this proportion. 
We would have to put a (second stage) prior on the remaining four parameters 
making this an example of hierarchical priors. A somewhat simpler approach 
(empirical Bayes) is to estimate the (hyper) parameters from data. We will 
see in Chapter 9 that there is a lot of information about them in the data. In 
either case, data about all the genes affect inference about each gene through 
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these (hyper) parameters that are common to all the genes. Our prior is based 
on a judgment of exchangeability of (u;,07), i = 1,2,...,p and de Finetti’s 
Theorem in Section 2.12. 

The inference for each gene is quite simple in the PEB (parametric empir- 
ical Bayes) approach. It is more complicated in the hierarchical Bayes setup 
but doable. Both are discussed in Chapter 9. 


2.12 Exchangeability 


One may often be able to judge that a set of parameters (01,...,6,) or a set 
of observables like (X1, X2,..., Xn) are exchangeable, i.e., their joint distri- 
bution function is left unaltered if the arguments are permuted. Thus if 


P{X, oe eae, on Sere ee, Gl < 2j,,°°°; Xn < Xi, } 


for all n! permutations 2;,,...,%;, Of ©1,...,@%,, one says X1,..., Xn are 
exchangeable. A simple way of generating exchangeable random variables is 
to choose an indexing random parameter 7 and have the random variables 
conditionally i.i.d. given 7. In many cases the converse is also true, as shown 
by de Finetti (1974, 1975), and Hewitt and Savage (1955). We only discuss 
de Finetti’s theorem. 

We say X;, i = 1,2,...,n,n+1,..., is a sequence of exchangeable random 
variables if Vn > 1, X1, X2,..., Xn are exchangeable. 


Theorem 2.21. (de Finetti). Suppose X;’s constitute an exchangeable se- 
quence and each X; takes only values 0 or 1. Then, for some r, 


1 Tr Te 
P{X; = £1, °, Xn = at a | joc “(I = ae us dr(n), 
0 


Vn, V21,...,2%n equal to 0 or 1, ie., given n, X1,...,Xn are conditionally 
1.2.4. Bernoulli with parameter n and 7 has distribution n. 


A Bayesian may interpret this as follows. The subjective judgment of ex- 
changeability leads to both a Bernoulli likelihood and the existence of a prior 
m. If one has also a prediction rule as in Problem 18, 7 can be specified. Thus 
at least in this interpretation the prior and the likelihood have the same logical 
status, both arise from a subjective judgment about observables. 

Hewitt and Savage (1955) show that even if the random variables take 
values in RP, or more generally in a nice measurable space, then a similar 
representation as conditionally i.i.d. random variables holds. See Schervish 
(1995) for a statement and proof. 

In many practical cases, vide Example 2.20, one may perceive certain pa- 
rameters 0;,...,@, as exchangeable. Even if the parameters do not form an 
infinite sequence, it is convenient to represent them as conditionally i.i.d. given 
a hyperparameter. Often as in Example 2.20, the form of 7(6|7) is also dic- 
tated by operational convenience. We show in Chapter 5 we can check if this 
form is validated by the data. 
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2.13 Normative and Descriptive Aspects of Bayesian 
Analysis, Elicitation of Probability 


Do most people faced with uncertainty make a decision as if they were 
Bayesian, each with her subjective prior and utility? The answer is generally 
No. The Bayesian approach is not claimed to be a description of how people 
tend to make a decision. On the other hand Bayesians believe, on the basis 
of various sets of rationality axioms and their consequences (as discussed in 
Chapter 3), people should act as if they have a prior and utility. The Bayesian 
approach is normative rather than descriptive. There have been empirical as 
well as philosophical studies of these issues. We refer the interested reader to 
Raiffa and Schlaiffer (1961) and French and Rios Insua (2000). We explore 
tentatively a couple of issues related to this. 

It is an odd fact in our intellectual history that the concept of probability, 
which is so fundamental both in daily life and science, was developed only 
during the European Renaissance. It is tempting to speculate that our current 
inability to behave rationally under uncertainty is related to the late arrival of 
probability on the intellectual scene. Most Bayesians hope the situation will 
improve with the passage of time and attempts to educate ourselves to act 
rationally. 

Related to these facts is the inability of most people to express their un- 
certainty in terms of a well calibrated probability. Probability is still most 
easily calculated in gambling or similar problems where outcomes are equally 
likely, in problems like life or medical insurance, where empirical calculations 
based on repetitions is possible or under exchangeability. Most examples of 
successful elicitation of subjective probability involve exchangeability in some 
form. However, there have been some progress in elicitation. Some of these 
examples are discussed in Chapter 5. 

These examples and attempts notwithstanding, full elicitation of subjec- 
tive probability is still quite rare. Most priors used in practice are at least 
partly nonsubjective. They are obtained through some objective, i.e., non- 
subjective algorithms. In some sense they are uniform distributions that take 
into account what is known, namely some prior moments or quartiles and the 
geometry in the parameter space. We discuss objective priors and Bayesian 
analysis based on them in the next section. 


2.14 Objective Priors and Objective Bayesian Analysis 


We refer to the Bayesian analysis based on objective priors as objective 
Bayesian analysis. One would expect that as elicitation improves, subjective 
Bayesian analysis would be used increasingly in place of objective Bayesian 
analysis. All Bayesians agree that wherever prior information is available, one 
should try to use a prior reflecting that as far as possible. In fact, one of 
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the attractions of the Bayesian paradigm is that use of prior expert informa- 
tion is a possibility. Incorporation of prior expert opinion would strengthen 
considerably purely data based analysis in real-life decision problems as well 
as problems of statistical inference with small sample size or high or infi- 
nite dimensional parameter space. In this approach use of objective Bayesian 
analysis has no conflict with the subjectivist approach. It may also have a 
legitimate place in subjective Bayesian analysis as a reference point or origin 
with which to compare the role and importance of prior information in a par- 
ticular Bayesian decision. In a similar spirit, it may also be used to report to 
general readers or to a group of Bayesians with different priors. 

We discuss in Chapter 3 algorithms for generating common objective priors 
such as the Jeffreys or reference or probability matching priors. We also discuss 
there common criticisms, such as the fact that these priors are improper and 
depend on the experiment, as well as our answers to such criticisms. 

In examples with low-dimensional ©, objective Bayesian analysis has some 
similarities with frequentist answers, as in Examples 2.2 and 2.4, in that the 
estimate obtained or hypothesis accepted tends to be very close to what a 
frequentist would have done. However, the objective Bayesian has a poste- 
rior distribution and a data based evaluation of the error or risk associated 
with inference or decision, namely, the posterior variance or posterior error or 
posterior risk. 

In high-dimensional problems, e.g., Example 2.20, it is common to use 
hierarchical priors with objective prior of the above type used at the highest 
level of the hierarchy. One then typically uses MCMC without always checking 
whether the posteriors are proper — in fact checking mathematically may be 
very difficult. Truncation of the prior, with careful variation of the stability of 
the posterior provides good numerical insight. However, this is not the only 
place where an objective prior is used in the hierarchy. In fact, in Example 2.20, 
the prior for (j4;,07) arises from a subjective assumption of exchangeability 
but the particular form taken is for convenience. This is a non-subjective 
choice but, as indicated in Chapter 9, some data based validation is possible. 

The objective Bayesian analysis in high-dimensional problems is also close 
in spirit to frequentist answers to such problems. Indeed it is a pleasant fact 
that, as in low-dimensional problems but for different reasons, the frequentist 
answers are almost identical to the Bayesian answers. The frequentist answers 
are based on the parametric empirical Bayes (PEB) approach, in which the 
parameters in the last stage of hierarchical priors are estimated from data 
rather than given an objective prior. As in the low-dimensional case, the 
objective Bayesian analysis has some advantages over frequentist analysis. 
The PEB approach used by frequentists tends to underestimate the posterior 
risk. 

Though it is implicit in the above discussion, it is worth pointing out that 
Bayesian analysis can be based on an improper prior only if the posterior is 
proper. Somewhat surprisingly, the posterior is usually proper when one uses 
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the Jeffreys prior or a reference prior, but counter-examples exist; see Ghosh 
(1997) and Section 7.4.7. 

Because the objective priors are improper, the usual type of preposterior 
analysis cannot be made at the planning stage. In particular one cannot com- 
pare different designs for an experiment and make an optimal choice. For the 
same reason choosing optimal sample size is a problem. It is suggested in 
Chapter 6 that a partial solution is to take a few observations, say the min- 
imum number of observations needed to make the posterior proper, and use 
the proper posterior as a proper prior. The additional data can be used to 
update it. For an application of these ideas, see Ghosh et al. (2002). Unfortu- 
nately, when all the data have been collected at the stage of formulating the 
prior, one would need to modify the above simple procedure. 


2.15 Other Paradigms 


In earlier sections, we have discussed several aspects of the Bayesian paradigm 
and its logical advantages. In this context we have also discussed in some detail 
various problems with the classical frequentist approach. 

Some of these problems of classical statistics can be resolved, or at least 
mitigated by appropriate conditioning. Even though Birnbaum’s theorem 
shows extensive conditioning and restriction to minimal sufficiency would lead 
to fundamental changes in the classical paradigm and it may be quite awk- 
ward to find a suitable conditioning, the idea of conditioning makes it possible 
to reconcile a lot of objective Bayesian analysis and classical statistics if suit- 
able conditioning is made. At least this makes communication relatively easy 
between the paradigms. 

There have also been attempts to create a new paradigm of inference based 
on sufficiency, conditioning and likelihood. An excellent treatment is available 
is Sprott (2000). Some of our reservations are listed in Ghosh (2002). 

One should also mention belief functions and upper and lower probabili- 
ties of Dempster and Schafer (see Dempster (1967), Shafer (1976) and Shafer 
(1987)). Wasserman and Kadane (1990) have shown that under certain ax- 
ioms, their approach may be identified with a robust Bayesian point of view. 
Problems of foundations of probability and inference remain an active area. 

An entirely different popular approach is data analysis. Data analysis 
makes few assumptions, it is very innovative and yet easy to communicate, 
However, it is rather ad hoc and cannot quite be called a paradigm. If machine 
or statistical learning emerges as a new alternative paradigm for learning from 
data, then data analysis would find in it the paradigm it currently lacks. 


2.16 Remarks 


Even though there are several different paradigms, we believe the Bayesian 
approach is not only the most logical but also very flexible and easy to com- 
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municate. Many innovations in computation have led to wide applicability 
as well as wide acceptance from not only statisticians but other scientists. 
Within the Bayesian paradigm it is relatively easy to use information as well 
as solve real-life decision problems. Also, we can now construct our priors, 
with a fair amount of confidence as to what they represent, to what extent 
they use subjective prior information and to what extent they are part of an 
algorithm to produce a posterior. 

The fact that there are no paradoxes or counter-examples suggests the 
logical foundations are secure in spite of a rapid, vigorous growth, specially in 
the past two decades. The advantage of a strong logical foundation is that it 
makes the subject a discipline rather than a collection of methods, like data 
analysis. It also allows new problems to be approached systematically and 
therefore with relative ease. 

Though based on subjective ideas, the paradigm accepts likelihood, and 
frequentist validation in the real world as well as consequent calibration of 
probabilities, utilities, likelihood based methods. 

In other words, it seems to combine many of the conceptual and method- 
ological strengths of both classical statistics and data analysis, but is free from 
the logical weaknesses of both. 

Ultimately, each reader has to make up her own mind but hopefully, even 
a reader, not completely convinced of the superiority of Bayesian analysis, will 
learn much that would be useful to her in the paradigm of her choice. This 
book is offered in a spirit of reconciliation and exploration of paradigms, even 
though from a Bayesian point of view. In many ways current mutual interac- 
tion between the three paradigms is reminiscent of the periods of similar rapid 
growth in the eighteenth, nineteenth, and early twentieth centuries. We have 
in mind specially the history of least squares, which began as a data analytic 
tool, then got itself a probabilistic model in the hands of Gauss, Laplace, and 
others. The associated inferential probabilities were simultaneously subjective 
and frequentist. The interested reader may also want to browse through von 
Mises (1957). 


2.17 Exercises 


1. (a) (French (1986)) Three prisoners, A, B, and C, are each held in solitary 
confinement. A knows that two of them will be hanged and one will be set 
free but he does not know who will go free. Therefore, he reasons that he 

has i chance of survival. He asks the guard who will go free, but has no 

success there. Being an intelligent person, he comes up with the following 
question for the guard: 

If two of us must die, then I know that either B or C must die and 

possibly both. Therefore, if you tell me the name of one who is to die, 

I learn nothing about my own fate; further, because we are kept apart, I 

cannot reveal it to them. So tell me the name of one of them who is to 
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die. 

The guard likes this logic and tells A that C will be hanged. A now argues 
that either he or B will go free, and so now he has } chance of survival. 
Is this reasoning correct? 

b) There are three chambers, one of which has a prize. The master of cer- 
emonies will give the prize to you if you guess the right chamber correctly. 
You first make a random guess. Then he shows you one chamber which is 
empty. You have an option to stick to your original guess or switch to the 
remaining other chamber. (The chamber you guessed first has not been 
opened). What should you do? 

. Suppose X|u ~ N(p, 07), o? known and u ~ N(n,77), n and 7? known. 
(a) Show that the joint density g(x, p) of X and p can be written as 


vo{-} [am , 2-2") 





g(x, u) = n(u)f(z|u) = 3 2 z2 


_ (x —n)? 


2n(T? +0?) j ( 2(T? +07) 
x4/ 5—7 TrA = aL eee 
QnT2o2 “xP 2720? (1 T2 +0? (3 = 52) 


(b) From (a) show that the marginal density m(x) of X is 


Q2NOT 





= 1 CN E el 
AE 2n(7? +0?) j ( 7 oe i 


and the posterior density m(p\x) of u| X = z is 


r2 + g2 7? +- oe 726? n x 2 
Ae) V Wnr2c2 R f- 27202 (x T2 +o? la 7 52) 


(c) What are the posterior mean and posterior s.d. of p given X = x? 
(d) Instead of a single observation X as above, consider a random sam- 
ple X;,...,X,. What is the minimal sufficient statistic and what is the 
likelihood function for u now? Work out (b) and (c) in this case. 

. Let X1,...,Xn be iid. N(1,07), o? known. Consider testing 


Ho: u < po versus Ay: u > po. 


(a) Compute the P-value. Compare it with the posterior probability of Ho 
when yp is assumed to have the uniform prior. 

(b) Do the same for a sharp Ho. 

. Refer to Welch’s problem, Example 2.6. Follow Fisher’s suggestion and 
calculate P{CI covers 6|X, — X2} and verify it agrees with the objective 
Bayes solution with improper uniform prior for 6. 
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. (Berger’s version of Welch’s problem, see Berger (1985b)). Suppose X; 


and Xə are i.i.d. having the discrete distribution: 


x= 0 — 1/2 with probability 1/2; 
= | @+1/2 with probability 1/2, 


where @ is an unknown real number. 
(a) Show that the set C given by 


aa W + X2)/2} if Xı # Xo; 
{Xi—1} ifXi = Xo, 


is a 75% confidence set for 8. 
(b) Calculate P{C covers 6|X; — X2}. 


. Can the Welch paradox occur if X1, Xe are iid. N(0,1)? 
. (Newton versus Einstein). In Example 2.12 calculate the Bayes factor, 


B Fo; for the given data using Jeffreys prior. 


. Let X1,..., Xn be iid. Bernoulli(p) (i.e., B(1, p)). 


(a) Assume p has Jeffreys prior. Construct the 100(1 — a)% HPD credible 
interval for p. 

(b) Suppose n = 10 and a = 0.05. Calculate the frequentist coverage 
probability of the interval in (a) using simulation. 


. Consider the same model as in Problem 8. Derive the minimax estimate of 


p under the square error loss. Plot and compare the mean square error of 
this estimate with that of X for n = 10, 50, 100, and 400. (The minimax 
estimate seems to do better at least upto n = 100.) 

Let Xj,...,Xn be iid. N(u,07), o? known. Suppose p has the N(7, 77) 
prior distribution with known 7 and 7°. 

(a) Construct the 100(1 — a)% HPD credible interval for u. 

(b) Construct a 100(1 — a)% predictive interval for Xn+1. 

(c) Consider the uniform prior for this problem by letting 7? — oo. Work 
out (a) and (b) in this case. 

(a) Refer to Example 2.6. Let C(X1, X2) denote the 100(1 — a)% confi- 
dence interval for 6. Assume that 6 has Jeffreys prior. Then show that 


P{C(X1, X2) covers 6|X; — Xo} = P{@ € C(X1, X2) X1, X2}. 
(b) Recall Example 2.5. Assume that u has Jeffreys prior. Then show that 
Var(uļæ) = Var(X|n). 


Let X1,...,Xn be iid. N(u,o7), o? unknown. Consider Jeffreys test 
(Section 2.7.2) for testing Ho : u = po versus Hı : u # po. Consider 
both the normal and Cauchy priors for ulo? under Hı. Suppose X —> œœ 
and s? is bounded. Compute BFo, under both the priors and show that 
BFo, converges to zero for Cauchy prior but does not converge to zero for 
normal prior. 


13. 


14. 


15. 


16. 


Ls 
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(a) Refer to Example 2.19. Show that the usual MLE of a”, namely, 
1<“ 2 : 
a ys {(Xia — Xi)? + (Xiz — X} 
i=l 


is inconsistent. Correct it to get a consistent estimate. 

(b) Suppose Xy,...,X, are iid. B(n,p). Both n and p are unknown, 
but only n is of interest, so p is a nuisance parameter (see Berger et al. 
(1999)). Derive the following likelihoods for n: (i) profile likelihood, (ii) 
conditional likelihood, i.e., that obtained from the conditional distribution 
of X;’s given their sum (and n), (iii) integrated likelihood with respect to 
the uniform prior, and (iv) integrated likelihood with respect to Jeffreys 
prior. 

(c) Suppose the observations are (17, 19, 21, 28, 30). Plot and compare 
the different likelihoods in b) above, and comment. 

Suppose X |p ~ Nz (pt, X), X known and pp ~ N,(n, T), 7 and F known. 
(a) Show that the above probability structure is equivalent to X = +e, 
e ~ N,(0,2), y ~ Np(n, T`), € and p are independent and X, 7, I are 
known. 

(b) From (a) show that the joint distribution of X and p is 


Ci) (G) CE r) 


(c) From (b) and using multivariate normal theory, show that 
UX =anrN,(C(L4l) tet ASET mI ISA T]: 


(d) What are the posterior mean and posterior dispersion matrix of u? 
Construct a 100(1 — a)% HPD credible set for p. 

(e) Work out (d) with a uniform prior. 

Let X1,..., Xm and Y1,..., Yn be independent random samples, respec- 
tively, from N (1,0?) and N(u2,0°), where o? is known. Construct a 
100(1 — a)% credible interval for (4, — H2) assuming a uniform prior on 
(141, H2). 

Let X1,...,Xm and Y;,...,¥Y, be independent random samples, respec- 
tively, from N(p,07) and N(p,0%), where both o? and o% are known. 
Construct a 100(1 — a)% credible interval for the common mean p as- 
suming a uniform prior. Show that the frequentist 100(1 — a)% confidence 
interval leads to the same answer. 

(Behrens-Fisher problem) Let X1,..., Xm and Y1,..., Yn be independent 
random samples, respectively, from N(u1,02) and N(2,0%), where all 
the four parameters are unknown, but inference on j4; — Huo is of interest. 
To derive a confidence interval for pı — p2 and also test Ho : pı = H2, the 
Behrens-Fisher solution is to use the statistic 
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= X-¥ 
Js? /m + s3/n 


where s? = 32%, (Xj — X)?/(m— 1) and 33 = D}, (Y; — ¥)2/(n- 1). 
(a) Show that 
si/m+s3/n X 
a/m+o3/n v 


approximately, where v can be estimated by 


(si/m + 83/n)* 
s{/(m?(m — 1)) + 83/(n?(n — 1)) 


(Hint: If we want to approximate the weighted sum, is a; V; of indepen- 
dent R by a x2/v, then a method of moment estimate for v is available, 
see Satterwaite (1946) and Welch (1949).) 

(b) Using (a), justify that T is approximately distributed like a Student’s 
t with v degrees of freedom under Hp. 

(c) Show numerically that the 100(1 — a)% confidence interval for 41 — H2 
derived using T is conservative, i.e., its confidence coefficient will always be 
> 1—a. (See Robinson (1976). A Bayesian solution to the Behrens-Fisher 
problem is discussed in Chapter 8.) 

Suppose X1, X2,..., Xn are i.i.d. Bernoulli(p) and the prediction loss is 
squared error. Further, suppose that for all n > 1, the Bayes prediction 
rule is given by 


3 


y= 


E(Xan Xi- Xn) = Aa, 


for some a > 0 and 8 > 0. Show that this is possible iff the prior on p is 
Beta(a, 3). 


Suppose (N1, ..., Nk) have the multinomial distribution with density 


fni... nkp) = me ef alle 


Let p have the Dirichlet prior with density 


= TO a ai) ; a,—l1 
f (pla) =P T Fan Hr 


(a) Find the posterior distribution of p. 
(b) Find the posterior mean vector and the dispersion matrix of p. 
(c) Construct a 100(1 — a)% HPD credible interval for pı and also for 


pı + p2. 
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20. Let p denote the probability of success with a particular drug for some 


21. 


22. 


disease. Consider two different experiments to estimate p. In the first 
experiment, n randomly chosen patients are treated with this drug and 
let X denote the number of successes. In the other experiment patients are 
treated with this drug, one after the other until r successes are observed. 
In this experiment, let Y denote the total number of patients treated with 
this drug. 

(a) Construct 100(1 — a)% HPD credible intervals for p under U(0,1) and 
Jeffreys prior when X = z is observed. 

(b) Construct 100(1 — a)% HPD credible intervals for p under U(0,1) and 
Jeffreys prior when Y = y is observed. 

(c) Suppose n = 16, x = 6, r = 6, and y = 16. Now compare (a) and (b) 
and comment with reference to LP. 

Let X1,...,Xn be iid. N(p,07), o? known. Suppose we want to test 
Ho : u = po versus Ay : p A uo. Let mo = P(Ho) = 1/2 and under Aj, let 
u ~ N(tt9,77). Show that, unlike in the case of a one-sided alternative, 
P-value and the posterior probability of Ho can be drastically different 
here. 

Let X1, ..., Xn beiid. N(u, 07), where both u and o° are unknown. Take 
the prior 7(,07) x o~*. Consider testing 


Ho: u < uo versus Hy: p > Uo. 


Compute the P-value. Compare it with the posterior probability of Ho. 
Compute the Bayes factor B Foy. 
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Utility, Prior, and Bayesian Robustness 


We begin this chapter with a discussion of rationality axioms for preference 
and how one may deduce the existence of a utility and prior. Later we explore 
how robust or sensitive is Bayesian analysis to the choice of prior, utility, 
and model. In the process, we introduce and examine various quantitative 
evaluations of robustness. 


3.1 Utility, Prior, and Rational Preference 


We have introduced in Chapter 2 problems of estimation and testing as 
Bayesian decision problems. We recall the components of a general Bayesian 
decision problem. 

Let X be the sample space, O the parameter space, f(xz|@) the density of X 
and 7(@) prior probability density. Moreover, there is a space A of actions “a” 
and a loss function L(@, a). The decision maker (DM) chooses “a” to minimize 
the posterior risk 


fas / L(6,a)n(6\x) dd, (3.1) 


where 7(6/|x) is the posterior density of 0 given x. Note that given the loss 
function and the prior, there is a natural preference ordering a, < az (i.e., a2 
is at least as good as aj) iff w(agix) < Yy(aiļz). 

There is a long tradition of foundational study dating back to Ramsey 
(1926), in which one starts with such a preference relation on A x A satis- 
fying certain rational axioms (i.e., axioms modeling rational behavior) like 
transitivity. It can then be shown that such a relation can only be induced as 
above via a loss function and a prior. i.e., JL and 7 such that 


Bas J PAROLE / L(6, a;)m(0) dd. (3.2) 


In other words, from an objectively verifiable rational preference relation, one 
can recover the subjective loss function and prior. If there is no sample data, 
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then 7 would qualify as a subjective prior for the DM. If we have data x, a 
likelihood function f({z|@) is given and we are examining a preference relation 
given x, then also one can deduce the existence of L and 7 such that 


id Ut J EA E < / L(0,a;)n(6|z)d0 (3.3) 


under appropriate axioms. 

In Section 3.2, we explore the elicitation or construction of a loss function 
given certain rational preference relations. In the next couple of sections, we 
discuss a result that shows we must have a (subjective) prior if our preference 
among actions satisfies certain axioms about rational behavior. Together, they 
justify (3.2) and throw some light on (3.3). In the remaining sections we 
examine different aspects of sensitivity of Bayesian analysis with respect to the 
prior. Suppose one thinks of the prior as only an approximate quantification 
of prior belief. In principle, one would have a whole family of such priors, all 
approximately quantifying one’s prior belief. How much would the Bayesian 
analysis change as the prior varies over this class? This is a basic question in 
the study of Bayesian robustness. 

It turns out that there are some preference relations weaker than those of 
Section 3.3 that lead to a situation like what was mentioned above. i.e., one 
can show the existence of a class of priors such that 


acon ae / L(8,a9)n(6) dO < / L(0, a;)(6) dé (3.4) 


for all m in the given class. This preference relation is only a partial ordering, 
i.e., not all pairs a1, a2 can be ordered. 

The Bayes rule a(x) minimizing 7(a|x) also minimizes the integrated risk 
of decision rules 6(z), 


r(m,8) = | R(0,8)n(6) a8, 


where R(0,ô) is the risk of 6 under 0, namely, fy L(6,5(z)) f(2|@) dx. Given 


a pair of decision rules, we can define a preference relation 
aı Saz iff r(m,ae(.)) < r(t,ai(.)). (3.5) 


One can examine a converse of (3.5) in the same way as we did with (3.2) 
through (3.4). One can start with a preference relation that orders decision 
rules (rather than actions) and look for rationality axioms which would guar- 
antee existence of L and r. For (3.2), (3.3) and (3.5) a good reference is 
Schervish (1995) or Ferguson (1967). Classic references are Savage (1954) and 
DeGroot (1970); other references are given later. For (3.4) a good reference is 
Kadane et al. (1999). 

A similar but different approach to subjective probability is via coherence, 
due to de Finetti (1972). We take this up in Section 3.4. 
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3.2 Utility and Loss 


It is tempting to think of the loss function L(@,a) and a utility function 
u(@,a) = —L(@,a) as conceptually a mirror image of each other. French and 
Rios Insua (2000) point out that there can be important differences that 
depend on the context. 

In most statistical problems the DM (decision maker) is really trying to 
learn from data rather than implement a decision in the real world that has 
monetary consequences. For convenience we refer to these as decision problems 
of Type 1 and Type 2. In Type 1 problems, i.e., problems without monetary 
consequences (see Examples 2.1-2.3) for each 0 there is usually a correct 
decision a(@) that depends on @, and L(@,a) is a measure of how far “a” is 
away from a(@) or a penalty for deviation from a(@). In a problem of estimating 
6, the correct decision a(@) is @ itself. Common losses are (a — 6)*, |a — 8|, 
etc. In the problem of estimating 7(@), a(@) equals 7(@) and common losses 
are (T(0) — a)’, |7(@) — al, etc. In testing a null hypothesis Hp against Ay, 
the 0-1 loss assigns no penalty for a correct decision and a unit penalty for 
an incorrect decision. In Type 2 problems, there is a similarity with gambles 
where one must evaluate the consequence of a risky decision. Historically, in 
such contexts one talks of utility rather than loss, even though either could 
be used. We consider below an axiomatic approach to existence of a utility for 
Type 2 problems but we use the notations for a statistical decision problem 
by way of illustration. We follow Ferguson (1967) here as well as in the next 
section. 

Let P denote the space of all consequences like (@,a). It is customary to 
regard them as non-numerical pay-offs. Let P* be the space of all probability 
distributions on P that put mass on a finite number of points. The set P* 
represents risky decisions with uncertainty quantified by a known element of 
P*. Suppose the DM has a preference relation on P*, namely a total order, 
i.e., given any pair pı, p2 E€ P*, either pı < p2 (p2 is preferred) or po < pı 
(pı is preferred) or both. Suppose also the preference relation is transitive, 
i.e., if py < po and po < p3, then pı < p3. We refer the reader to French and 
Ríos Insua (2000) for a discussion of how compelling are these conditions. It is 
clear that one can embed P as subset of P* by treating each element of P as 
a degenerate element of P*. Thus the preference relation is also well-defined 
on P. Suppose the relation satisfies axioms H; and Ho. 








H, If pı, po and q € P* and 0 < à < 1, then pı < po if and only if Ap; + (1 — 
A)g ~ àp2 + (1 — Ada. 

Hə If pı, po, ps are in P* and pı < po < p3, then there exist numbers 
0<A<1,0< u< 1, such that 


Ap3 + (1 — à)pı < po < ups + (1 — u)pı. 


Ferguson (1967) shows that if H; and Hə hold then there exists a utility 
u(.) on P* such that pı < p2 if and only if u(p1) < u(pe), where for p* = 
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So Api, With A; > 0 and $3," à; = 1, u(p*) is defined to be the average 


u(p*) = $ Aiu(pi). (3.6) 


t=1 


The main idea of the proof, which may also be used for eliciting the utility, 
is to start with a pair pj < p3, i.e., pi < p> but pj ~ p3. (Here ~ denotes the 
equivalence relation that the DM is indifferent between the two elements.) 
Consider all pj < p* < ps. Then by the assumptions Hı and H3, one can 
find 0 < A* < 1 such that the DM would be indifferent between p* and 
(1—A*)pt +A*p3. One can write A* = u(p*) and verify that pj < p3 < pi < p3 
iff AS = u(p3) < u(pz) = AZ as well as the relation (3.6) above. For p> < p*, 
by a similar argument one can find a 0 < u* < 1 such that 


p3 ~ (1— u*)pi + u*p* 


from which one gets 

p* ~ (1 — A*)pi + A"po, 
where A* = 1/p*. Set A* = u(p*) as before. In a similar way, one can find a 
A* for p* < pt and set u(p*) = A*. 

In principle, A* can be elicited for each p*. Incidentally, utility is not 
unique. It is unique up to a change in origin and scale. Our version is chosen 
so that u(pï) = 0, u(p3) = 1. 

French and Rios Insua (2000) point out that most axiomatic approaches 
to the existence of a utility first exhibit a utility on P* and then restrict it 
to P, whereas intuitively, one would want to define u(.) first on P and then 
extend it to P*. They discuss how this can be done. 


3.3 Rationality Axioms Leading to the Bayesian 
Approach! 


Consider a decision problem with all the ingredients discussed in Section 3.1 
except the prior. If the sample space and the action space are finite, then the 
number of decision functions (i.e., functions from Æ to A) is finite. In this 
case, the decision maker (DM) may be able to order any pair of given decision 
functions according to her rational preference of one to the other taking into 
account consequences of actions and all inherent uncertainties. Consider a 
randomized decision rule defined by 


ô = p10, + pada +--+ + Pkôk, 


where 61, 62,...,6% constitute a listing of all the non-randomized decision 
functions and (p1, p2,..-, pk) is a probability vector, i.e., pi > 0 and yar = 


1 Section 3.3 may be omitted at first reading. 
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1. The representation means that for each z, the probability distribution 6(z) 
in the action space is the same as choosing the action 6;(x) with probability p;. 
Suppose the DM can order any pair of randomized decision functions also in a 
rational way and it reduces to her earlier ordering if the randomized decision 
functions being compared are in fact non-randomized with one p; equal to 
1 and other p;’s equal to zero. Under certain axioms that we explore below, 
there exists a prior 7(@) such that ôf ~ 63, i.e., the DM prefers df to 63 if and 
only if 


r(x, 81) = >) x(0)Po(aldt)L(8, a) < X (6) Po(al63)L(8, a) = r(7, 63), 
6,a 8,a 


“a 99 


where P(a/é*) is the probability of choosing the action when @ is the value 
of the parameter and ĝ* is used, i.e., using the representation 6* = $; př ôi, 


Po(a\d*) = 2, Pela) 2 Pi (E); 


and J; is the indicator function 


re 1:1, O12) =a 
LUS T 0 ot) Fa: 


We need to work a little to move from here to the starting point of Ferguson 
(1967). 

As far as the preference is concerned, it is only the risk function of 6 that 
matters. Also ô appears in the risk function only through P,(a/é) which, for 
each 6, is a probability distribution on the action space. Somewhat trivially, 
for each ĝo € O, one can also think of it as a probability distribution q on the 
space P of all (@,a), 0 € O, a E€ A such that 


_ f Po, (ald) if 6 = 9; 
ee 0 if £ 6p. 


As in Section 3.2, let the set of probability distributions putting probability 
on a finite number of points in P be P*. The DM can think of the choice of a 
ô as a somewhat abstract gamble with pay-off (P9(a,/6), Pg(a2|),---) if @ is 
true. This pay-off sits on (6,a1), (@,a2).... Let G be the set of all gambles of 
this form |[p1,...,2m]} where p; is a probability distribution on P that is the 
pay-off corresponding to the ith point 0; in O = {0),69,...,4m}. Further, let 
G* be the set of all probability distributions putting mass on a finite number 
of points in G. The DM can embed her 6 in G and suppose she can extend her 
preference relation to G and G*. If axioms H; and Hg of Section 3.2 hold, then 
there exists a utility function ug on G* that induces the preference relation 
Xg on G*. We assume the preference relations < on P* and x, on G* are 
connected as follows vide Ferguson (1967). 


Ay If p; ee oT Das gs then Dizes Dia E (ees Pnl 
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Ao If p< p’, then [p,...,p] ~g [p’,...,p’]. 


To proceed further, we need one more assumption, Ag of Ferguson (1967). 
If pi,...,px are elements of P and \1,...,Ax% are non-negative numbers adding 
up to 1, let (A1pi,...,Agnp,) denote the element of P* that chooses pay-off p; 
with probability A;, 1 < į < k. Then Ag is given by 


Ag (Ai |p11,---,;Pim],°°°;An[Pk1>---»DPkm)) 


og [(A1 P11, cg AkPk1)s kas (AiPim; ...3 AkPkm)| ; 


where ~g denotes equivalence under the preference relation on G*. 


Then, under these three assumptions, it is shown by Ferguson that ~<, is 
induced by a prior 7(@) and the loss function L(@, a) as indicated in Section 3.1. 

The need to extend the preference relation on the space of decision func- 
tions to all pairs of elements of G* is somewhat artificial. It is of course true 
that in many practical decision problems the space G* would occur naturally. 
For example, even in a statistical problem, if the loss or utility arising from 
the combination (8, a) doesn’t depend on @, then the extension to G* would 
be relatively natural. An illuminating and penetrating discussion of various 
sets of axioms leading to existence of utility and prior appears in Chapter 2 
of French and Rios Insua (2000). They also provide references to a huge 
literature and a brief survey. 


3.4 Coherence 


There is an alternative way of justifying a Bayesian approach to decision 
making on the basis of the notion of coherence as modified by Freedman and 
Purves (1969) and Heath and Sudderth (1978). Coherence was originally 
introduced by de Finetti to show any quantification of uncertainty that does 
not satisfy the axioms of a (finitely additive) probability distribution would 
lead to sure loss in suitably chosen gambles. This is treated in Appendix C. 
To return to coherence in the context of decision making, suppose A stands 
for a set in the space of 6 and x values, and A, = {0 : (0,2) € A}. Given 
x, the DM’s uncertainty about A is given by q(x, Az). An MC (master of 
ceremonies) chooses a betting system (A, b), where A is as above and b is a 
bounded real valued function of xz. The DM accepts the gamble with pay-off 


we, T) i b(x) [Ta (0, T) = qli, A,)| > 


She gets b(x)q(x, Áz) or pays b(x)[1 — q(x, A,)] depending on whether 6 lies 
in A, or not. The expected pay-off is 


E(6) = J (0, x) p(dz|é). 
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If she accepts k such gambles defined as above by (Aa), b(1)), --- , (Ace), bix) ), 
then her expected pay-off is the sum of the k expected pay-offs. She will face 


sure loss if : 
inf (>: re) > 0. 


The idea is that if q reflects her uncertainty about 8, then this combination of 
bets is fair and so acceptable to her. However any rational choice of g should 
avoid sure loss as defined above. Such a choice is said to be coherent if no 
finite combination of acceptable bets can lead to sure loss. The basic result of 
Freedman and Purves (1969) and Heath and Sudderth (1978) is that in order 
to be coherent, the DM must act like a Bayesian with a (finitely additive) prior 
and g must be the resulting posterior. A similar result is proved by Berti et 
al. (1991). 


3.5 Bayesian Analysis with Subjective Prior 


We have already discussed basics of subjective prior Bayesian inference in 
Chapter 2. In the following, we shall concentrate on some issues related to 
robustness of Bayesian inference. The notations used will be mostly as given 
in Chapter 2, but some of those will be recalled and a few additional notations 
will be introduced here as needed. 

Let X be the sample space and O be the parameter space. As before, 
suppose X has (model) density f(z|@) and 8 has (prior) probability density 
m(@). Then the joint density of (X,@), for x € X and 0 € O, is 


h(x, 0) = f(x|@)r(@). 


The marginal density of X corresponding with this joint density is 


ee ee Cale l f(x|0 dr(0). 
o 
Note that this can be expressed as 


TE fo f(xj@)r(0) dO if X is continuous, 
MrT) = Xo f(xl@)x(9) if X is discrete. 


Often we shall use m(x) for m,(x), especially if the prior m which is being 
used is clear from the context. Recall that the posterior density of # given x 


is given by 
h(x, 6 O\n(8 
(ox) = PED _ LEO 
M(x) M(x) 
The posterior mean and posterior variance with respect to prior m will be 
denoted by Æ” (|x) and V"(6\|x), respectively. Similarly, the posterior prob- 
ability of a set A C O given x will be denoted by P7(Ajz). 
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3.6 Robustness and Sensitivity 


Intuitively, robustness means lack of sensitivity of the decision or inference to 
assumptions in the analysis that may involve a certain degree of uncertainty. 
In an inference problem, the assumptions usually involve choice of the model 
and prior, whereas in a decision problem there is the additional assumption 
involving the choice of the loss or utility function. An analysis to measure 
the sensitivity is called sensitivity analysis. Clearly, robustness with respect 
to all three of these components is desirable. That is to say that reasonable 
variations from the choice used in the analysis for the model, prior, and loss 
function do not lead to unreasonable variations in the conclusions arrived 
at. We shall not, however, discuss robustness with respect to model and loss 
function here in any great detail. Instead, we would like to mention that there 
is substantial literature on this and references can be found in sources such 
as Berger (1984, 1985a, 1990, 1994), Berger et al. (1996), Kadane (1984), 
Leamer (1978), Rios Insua and Ruggeri (2000), and Wasserman (1992). 

Because justification from the viewpoint of rational behavior is usually 
desired for inferential procedures, we would like to cite the work of Nobel lau- 
reate Kahneman on Bayesian robustness here. In his joint paper with Tversky 
(see Tversky et al. (1981) and Kahneman et al. (1982)), it was shown in psy- 
chological studies that seemingly inconsequential changes in the formulation 
of choice problems caused significance shifts of preference. These ‘inconsis- 
tencies’ were traced to all the components of decision making. This probably 
means that robustness of inference cannot be taken for granted but needs to 
be earned. 

The following example illustrates why sensitivity to the choice of prior can 
be an important consideration. 


Example 3.1. Suppose we observe X, which follows Poisson(@) distribution. 
Further, it is felt a priori that 0 has a continuous distribution with median 2 
and upper quartile 4. i.e. P™(@ < 2) = 0.5 = P” (0 > 2) and P” (8 > 4) = 0.25. 
If these are the only prior inputs available, the following three are candidates 
for such a prior: 

(i) mı : 0 ~ exponential(a) with a = log(2)/2; 

(ii) m2 : log(@) ~ N(log(2), (log(2)/z.25)”); and 

(iii) 73 : log(@) ~ Cauchy(log(2), log(2)). 

Then (i) under 7, 62 ~ Gamma(a+ 1,2 + 1), so that the posterior mean is 
(a+ 1)/(x + 1); 

(ii) under 72, if we let y = log(@), and T = log(2)/z.25 = log(2)/0.675, we 
obtain 


E™ (|x) = E™*(exp(7)|z) 
fJZ exp(—e7) exp(ya) exp(—(7 — log(2))?/(217?)) dy ` 


and (iii) under 73, again if let y = log(@), we get 
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Table 3.1. Posterior Means under 71, 72, and 73 


T 
0 1 2 3 4 5 10 15 20 50 


T 


Ti -749 1.485 2.228 2.971 3.713 4.456 8.169 11.882 15.595 37.874 
m2 .950 1.480 2.106 2.806 3.559 4.353 8.660 13.241 17.945 47.017 
m3 .761 1.562 2.094 2.633 3.250 3.980 8.867 14.067 19.178 49.402 


E": (6|x) = E"*(exp(7)|2) 


=] 
Petea) [+ CBE] a 


To see if the choice of prior matters, simply examine the posterior means 
under the three different priors in Table 3.1. 


For small or moderate x (x < 10), there is robustness: the choice of prior 
does not seem to matter too much. For large values of x, the choice does 
matter. The inference that a conjugate prior obtains then is quite different 
from what a heavier tailed prior would obtain. It is now clear that there are 
situations where it does matter what prior one chooses from a class of priors, 
each of which is considered reasonable given the available prior information. 

The above example indicates that there is no escape from investigating 
prior robustness formally. How does one then reconcile this with the single 
prior Bayesian argument? It is certainly true that if one has a utility/loss 
function and a prior distribution there are compelling reasons for a Bayesian 
analysis using these. However, this assumes the existence of these two enti- 
ties, and so it is of interest to know if one can justify the Bayesian viewpoint 
for statistics without this assumption. Various axiomatic systems for statis- 
tics can be developed (see Fishburn (1981)) involving a preference ordering 
for statistical procedures together with a set of axioms that any ‘coherent’ 
preference ordering must satisfy. Justification for the Bayesian approach then 
follows from the fact that any rational or coherent preference ordering cor- 
responds to a Bayesian preference ordering (see Berger (1985a)). This means 
that there must be a loss function and a prior distribution such that this ax- 
iom system is compatible with the Bayesian approach corresponding to these. 
However, even then there are no compelling reasons to be a die-hard single 
prior Bayesian. The reason is that it is impractical to arrive at a total prefer- 
ence ordering. If we stop short of this and we are only able to come up with 
a partial preference ordering (see Seidenfeld et al. (1995) and Kadane et al. 
(1999)), the result will be a Bayesian analysis (again) using a class of prior 
distributions (and a class of utilities). This is the philosophical justification for 
a “robust Bayesian” as noted in Berger’s book (Berger (1985a)). One could, 
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of course, argue that a second stage of prior on the class I" of possible priors 
is the natural solution to arrive at a single prior, but it is not clear how to 
arrive at this second stage prior. 


3.7 Classes of Priors 


There is a vast literature on how to choose a class, I’ of priors to model prior 

uncertainty appropriately. The goals (see Berger (1994)) are clearly 

(i) to ensure that as many ‘reasonable’ priors as possible are included, 

(ii) to try to eliminate ‘unreasonable’ priors, 

(iii) to ensure that I’ does not require prior information which is difficult to 

elicit, and 

(iv) to be able to compute measures of robustness without much difficulty. 
As can be seen, (i) is needed to ensure robustness and (ii) to ensure that 

one does not erroneously conclude lack of robustness. The above mentioned are 

competing goals and hence can only be given weights which are appropriate in 

the given context. The following example from Berger (1994) is illuminating. 


Example 3.2. Suppose @ is a real-valued parameter, prior beliefs about which 
indicate that it should have a continuous prior distribution, symmetric about 
0 and having the third quartile, Q3, between 1 and 2. Consider, then 

I, = {N(0,7?),2.19 < 7? < 8.76} and 

I’, = { symmetric priors with 1 < Q3 < 2 }. 

Even though J can be appropriate in some cases, it will mostly be consid- 
ered “rather small” because it contains only sharp-tailed distributions. On 
the other hand, I> will typically be “too large,” containing priors, shapes of 
some of which will be considered unreasonable. Starting with Ih and imposing 
reasonable constraints such as unimodality on the priors can lead to sensible 
classes such as 


T3 = { unimodal symmetric priors with 1 < Q3 < 2 }D/;. 
It will be seen that computing measures of robustness is not very difficult for 
any of these three classes. 
3.7.1 Conjugate Class 


The class consisting of conjugate priors (discussed in some detail in Chapter 
5) is one of the easiest classes of priors to work with. If X ~ N(@,o*) with 
known o”, the conjugate priors for 6 are the normal priors N(j:,77). So one 
could consider 


To = {N(u,7"), pa SMS Ma, Ti ST? STi} 


for some specified values of p41, u2, T2, and 72. The advantage with the con- 
jugate class is that posterior quantities can be calculated in closed form 


3.7 Classes of Priors 75 


(for natural conjugate priors). In the above case, if @ ~ N(u,77), then 
6X =a ~ N(p*(x), 67), where p*(x) = (T2 /(T2 + 07) x + (07/(7? +07) py 
and 6? = 7207/(r? + a”). Minimizing and maximizing posterior quantities 
then becomes an easy task (see Leamer (1978), Leamer (1982), and Polasek 
(1985)). The crucial drawback of the conjugate class is that it is usually “too 
small” to provide robustness. Further, tails of these prior densities are similar 
to those of the likelihood function, and hence prior moments greatly influence 
posterior inferences. Thus, even when the data is in conflict with the specified 
prior information the conjugate priors used can have very pronounced effect 
(which may be undesirable if data is to be trusted more). Details on this can 
be found in Berger (1984, 1985a, 1994). It must be added here that mixtures 
of conjugate priors, on the other hand, can provide robust inferences. In par- 
ticular, the Student’s ¢ prior, which is a scale mixture of normals, having flat 
tails can be a good choice in some cases. We discuss some of these details later 
(see Section 3.9). 


3.7.2 Neighborhood Class 


If ro is a single elicited prior, then uncertainty in this elicitation can be mod- 
eled using the class 


Iy = {x which are in the neighborhood of mo}. 
A natural and well studied class is the e-contamination class, 
Ie = {r:n = (1 — ero +e, EQ}, 


ce reflecting the uncertainty in 7p and Q specifying the contaminations. Some 
choices for Q are, all distributions q, all unimodal distributions with mode 69, 
and all unimodal symmetric distributions with mode ĝo. The e-contamination 
class with appropriate choice of Q can provide good robustness as we will see 
later. 


3.7.3 Density Ratio Class 


Assuming the existence of densities for all the priors in the class, the density 
ratio class is defined as 


pr = {r : L(@) < an(6) < U(@) for some a > 0} 
= 


) 
E < 2 < VO 
U8) = (6) = LON) 











for all 8, s) i (3.7) 


for specified non-negative functions L and U (see DeRobertis and Hartigan 
(1981)). If we take L = 1 and U = c, then we get 





—1 T(8) / 
= : < < . 
[DR fz C < TR Sc for al 6,6'} 
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Some other classes have also been studied. For example, the sub-sigma field 
class is obtained by defining the prior on a sub-sigma field of sets. See Berger 
(1990) for details and references. Because many distributions are determined 
by their moments, once the distributional form is specified, sometimes bounds 
are specified on their moments to arrive at a class of priors (see Berger (1990)). 


3.8 Posterior Robustness: Measures and Techniques 


Measures of sensitivity are needed to examine the robustness of inference 
procedures (or decisions) when a class I’ of priors are under consideration. In 
recent years two types of these measures have been studied. Global measures 
of sensitivity such as the range of posterior quantities and local measures 
such as the derivatives (in a sense to be made clear later) of these quantities. 
Attempts have also been made to derive robust priors and robust procedures 
using these measures. 


3.8.1 Global Measures of Sensitivity 


Example 3.3. Suppose X1, Xe,...,Xn are iid. N(0,07), with o? known and 
let I be all N(0,77), 7? > 0, priors for 8. Then the variation in the poste- 
rior mean is simply (inf,2,9 E(6|Z),sup,2,9 E(6|Z)). Because, for fixed 7?, 
E(6|z%) = (r7/(r? + o7))Z, this range can easily be seen to be (0,2) or (%,0) 
according as Z > 0 or g < 0. If Z is small in magnitude, this range will be 
small. Thus the robustness of the procedure of using posterior mean as the 
Bayes estimate of @ will depend crucially on the magnitude of the observed 
value of 7. 


As can be seen from the above example, a natural global measure of sensi- 
tivity of the Bayesian quantity to the choice of prior is the range of this quan- 
tity as the prior varies in the class of priors of interest. Further, as explained 
in Berger (1990), typically there are three categories of Bayesian quantities of 
interest. 

(i) Linear functionals of the prior: p(t) = fe h(@) 7(d@), where h is a given 
function. 

If h is taken to be the likelihood function l, we get an important linear func- 
tional, the marginal density of data, i.e., m(n) = fo 1(@) 7(d@). 

(ii) Ratio of linear functionals of the prior: p(7) = aa fo ROJO) (dO) for 
some given function h. 

If we take h(@) = 0, p(n) is the posterior mean. For h(@) = Io(@), the indica- 
tor function of the set C, we get the posterior probability of C. 

(iii) Ratio of nonlinear functionals: p(7) = es fo h(, p(T) )1(8) n(d0) for 
some given h. For h(0, é(m)) = (@ — u(m))*, where u(r) is the posterior mean, 
we get p(t) = the posterior variance. 

Note that extreme values of linear functionals of the prior as it varies in a 
class I’ are easy to compute if the extreme points of J” can be identified. 


3.8 Posterior Robustness: Measures and Techniques TT 


Example 3.4. Suppose X ~ N(0,0°), with o? known and the class I of inter- 
est is 


Isy = { all symmetric unimodal distributions with mode 09}. 


Then ¢ denoting the standard normal density, m(m) = f° +¢(=5£)r(0) dé. 
Note that any unimodal symmetric (about 99) density m is a mixture of 
uniform densities symmetric about ĝo. Thus the extreme points of Isy are 
U(@) —1r,99 + r) distributions. Therefore, 








i ft 4 
ink m= ings gC 
= mg 2 nie -aC a) 
p,m) sap, gg 
= sup 5 [aE gfs), (3.9) 


In empirical Bayes problems (to be discussed later), for example, maxi- 
mization of the above kind is needed to select a prior. This is called Type IT 
maximum likelihood (see Good (1965)). 

To study ratio-linear functionals the following results from Sivaganesan 

and Berger (1989) are useful. 
Lemma 3.5. Suppose Cr is a set of probability measures on the real line 
given by Cr = {m :t € T}, T C R4, and let C be the convex hull of Cr. 
Further suppose hı and ho are real-valued functions defined on R such that 
f |hi(z)|dF (x) < œ for all F € C, and K + ho(x) > 0 for all x for some 
constant K. Then, for any k, 


k+ f hy(x) dF(2) k+ f hy(x)y4(dz) 
e ce ee 3.10 
rec K + J ho(x) dF (2) = K + J h(x) (dz) oy) 
k+ f hi(x)dF(£) _ k+ f hi(x)r(dz) (3.11) 
FEC K + f holz ai teT K + f ho(x)% (dx) | 
Proof. Because f hi(x)dF(x) = f hi(x) f™%(dx) (dt), for some probability 


measure p on T, using ct S o 


k+ ; hi(x) dF (x) = J (k + hy(a)) J v (dx)u(dt) 


E ome ATE 


E E TE 
Ie 


< (sup fiesta ie (x+ f moare). 
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Therefore, 


k+ f hi(x) dF(2) f(k + ha(x))v4(dz) 
ae aie Piedra) — 2) ik eae) 


However, because C D Cr, 


k+ f[hi(x) dF(z) f(k + hy(x))% (daz) 
pec K+ fho(a) dF(@) ~ ser [(K + hala) (de) 


Hence the proof for the supremum, and the proof for the infimum is along the 
same lines. O 


Theorem 3.6. Consider the class sy of all symmetric unimodal prior dis- 
tributions with mode bo. Then it follows that 


ar Jog 909) f (2/8) db 


sup E™(g(0)|z) = sup 22er - (3.12) 
nels r>0 A ST f(al0) d0 

a Jog» 9(9)F (|0) d8 
inf B(9(6)|x) = inf 2282S (3.13) 
nrélsy r>0 = od f(x|9) d8 


0) f(x|8) dr(@ 
Proof. Note that E7(g(@)|z) = en a A where f(x/@) is the density 


of the data z. Now Lemma 3.5 can be applied by recalling that any unimodal 
symmetric distribution is a mixture of symmetric uniform distributions. O 


Example 3.7. Suppose X|@ ~ N(6,07) and robustness of the posterior mean 
with respect to [sy is of interest. Then, range of posterior mean over this 
class can be easily computed using Theorem 3.6. We thus obtain, 


sup E7(@\xz) = sup oe 
melee para ara a, 


(7 ee 


= gz + su =e ee 
= p(t) (mE) 


Ao rT g~ 


ee Facet 
7 (==) — p(2=) 

= 0 inf Ae ee 

r>0 p(z) = f( 2-2-2) 


Example 3.8. Suppose X|@ ~ N(@,o7) and it is of interest to test Ho : 0 < bo 
versus Hı : 6 > ĝo. Again, suppose that [sy is the class of priors to be 
considered and robustness of this class is to be examined. Because 
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P* (Hol) = P"(6 < bolz) 
Sooo T2080} (8) f (218) dz (0) 
fZ F(w|@) dz(0) l 


we can apply Theorem 3.6 here as well. We get, 


—9 
sup P* (Hole) = sup 2 i ro 5 9(2 z ) de 
nElsy r>0 = Gotr 1 o(2=2) d0 
ĝo—r 5® o 


(tie) — p(er ae 
po P(t) — p(fa=r=2) 





and similarly, 


89-2 A@95—-r—x 
o p(z r- 
iat 2" (Hoe) = inf E A 
nELlsuy r>0 ( otro Jotror) =O) 


It can be seen that the above bounds are, respectively, 0.5 and a, where 
a = (==), the P-value. 





We shall now consider the density-ratio class that was mentioned earlier 
in (3.7) and is given by 


lpr = {r : L(0) < an(@) < U(@) for some a > 0}, 


for specified non-negative functions L and U. For 7 € pr and any real-valued 
m-integrable function h on the parameter space O, let m(h) = fo h(@)x(dé). 
Further, let h = ht — h~ be the usual decomposition of h into its positive 
and negative parts, i.e., ht (u) = max{h(x),0} and kh“ (uw) = max{—h(z), 0}. 
Then we have the following theorem (see DeRobertis and Hartigan (1981)). 


Theorem 3.9. For U -integrable functions hy and hg, with ho positive a.s. 
with respect to alla € I’pr, 


m(hy) 


neI'pr m(h2) 





is the unique solution À of 


U(hy — Ahg)” + L(hiı — Aho)* = 0, (3.14) 
sup mh) 
rEIprR 7( 2) 





is the unique solution À of 
U(hy — Aho)? + L(hy — Aha) = 0. (3.15) 


Proof. Let Ao = infrerppr TER, c1 = infrerpr W(h2) and co = SUPre rpp The). 
Then 0 < cy < cg < œ, and |Ao| < œœ. Because U(hy — Aha) + L(hiı — 
Ahe)t = infrerpp n(hı — Ahe) for any A, note that Ay» > A if and only 
if U(hy — Ahg)~ + L{hi — Ah2)t > 0. However, Ao > à if and only if 
U(hy — Aàh2) + L(hy — Ah2)t > 0. A similar argument for the supremum. O 
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Example 3.10. Suppose X ~ N(@,07), with a? known. Consider the class pr 
with L being the Lebesgue measure and U = kL, k > 1. Because the posterior 


mean is 

JOFO) dx(@) _ n(Of a8) 

J f(x|@)dx(@) (fF (2/8) ’ 
in the notation of Theorem 3.9, we have that inf,cr,, E” (0|x) is the unique 
solution À of 


À fo 
a (6 — A) f(a|0) d8 + J (8 — A) f (x|0) dé = 0, (3.16) 
—oo A 
and similarly, sup,¢r,,, £” (0|x) is the unique solution A of 
À ore) 
l (0 — d) (al) d0 + k J (6 — A) f(2]0) dð = 0. (3.17) 
—00 À 


Noting that f(z|0) = 1 ¢(=*) = ig(2*), and letting A; be the minimum 
and A2g the maximum, the above equations may be rewritten as 


Ay —- 2x Al —2 Ai — 2 DES 2 
e-n [AS A") +A] ==, Gs) 
A2—- T Ag — 2 À2— £ = Ag - T 
-n |=) =*) E] = eA). 19) 
Now let k(%=2) = y. Then \p = z+ ot. Put A = 2-—of, or aut =-—f. 


Then we see from the second equation above that 


Ag —2£ 





(= 1) (AoA) + (BE) 
= (k-1) |-Jo(-Z) + 6(-Z] 
= (k - 1) |-2a-92)) +4] 


= (k -1) |Z) +a] - (k-1)2 
= 0, 


Oo 


implying that once Ag is obtained, say 42 = x + of, the solution for A, is 
simply «—o7. Table 3.2 tabulates y = (k) for various values of k. What one 


Table 3.2. Values of y(k) for Some Values of k 
k |1[1.25| 15| 2?) 28 oh 5] 10 
(k)|0|0.089]0.162|0.276]0.436|0.549]0.636|0.901 


3.8 Posterior Robustness: Measures and Techniques 81 


can easily see from this table is that, if, for example, the prior density ratio 
between two parameter points is sure to be between 0.5 and 2, the posterior 
mean is sure to be within 0.276 standard deviation of x, and if instead the 
ratio is certain to be between 0.1 and 10, the range is certain to be no more 
than 1 s.d. either side. 


3.8.2 Belief Functions 


An entirely different approach to global Bayesian robustness is available, and 
this is through belief functions and plausibility functions. This originated with 
the introduction of upper and lower probabilities by Dempster (1967, 1968) 
but further evolved in various directions as can be seen from Shafer (1976, 
1979), Wasserman (1990), Wasserman and Kadane (1990), and Walley (1991). 
The terminology of infinitely alternating Choquet capacity is also used in the 
literature. Imprecise probability is a generic term used in this context, which 
includes fuzzy logic as well as upper and lower previsions. 

Recall that robust Bayesian inference uses a class of plausible prior proba- 
bility measures. It turns out that associated with a belief function is a convex 
set of probability measures, of which the belief function is a lower bound, and 
the plausibility function an upper bound. Thus a belief function and a plausi- 
bility function can naturally be used to construct a class of prior probability 
distributions. Some specific details are given below skipping technical details 
and some generality. 

Suppose the parameter space O is a Euclidean space and D is a convex, 
compact subset of a Euclidean space. Let u be a probability measure on D 
and T be a map taking points in D to nonempty closed subsets of O. Then 
for each A C O, define 





A, = {d€ D:T(d) c A},and 
A*={deED:T(d)\nAFx >}. 


Define Bel and PI on O by 
Bel(A) = p(A,) and PUA) = p(A*). (3.20) 


Then Bel is called a belief function and Pl, a plausibility function with source 
(D,u,T). Note that 0 < Bel(A) < PI(A) < 1, Bel(A) = 1 — PI(A°) for any 
A, and Bel(O) = PI(O) = 1, Bel(¢) = Pl(@) = 0. The above definition may 
be given the following meaning. If evidence comes from a random draw from 
D, then Bel(A) may be interpreted to be the probability that this evidence 
implies A is true, whereas PI(A) can be thought of as the probability that 
this evidence is consistent with A being true. It can be checked that Bel is 
a probability measure iff Bel(A) = PI(A) for all A, or equivalently, T'(d) is 
almost surely a singleton set. 
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Example 3.11. Suppose it is known that the true value of 9 lies in a fixed set 
Oo C O. Set T(d) = Oo for alld € D. Then Bel(A) = 1 if Oo C A; Bel(A) = 0 
otherwise. 


Example 8.12. Suppose P is a probability measure on O. Then P is also a 
belief function with source (O, P,T), where T(@) = {6}. 


A probability measure P is said to be compatible with Bel and Pl if for 
each A, Bel(A) < P(A) < PI(A). Let C be the set of all probability measures 
compatible with Bel and Pl. Then C Æ ¢ and for each A, 


Bel(A) = inf P(A) and PI(A) = sup P(A). 
PEC PEC 

This indicates that we can use Bel and Pl to construct prior envelopes. In 
particular, if Bel and Pl arise from any available partial prior information, 
then the set of compatible probability measures, C, is exactly the class of prior 
distributions that a robust Bayesian analysis requires (compare with (3.4)). 

Let h: O — R be any bounded, measurable function. Define its upper 
and lower expectations by 


Eh) = sup Ep(h) and E,(h) = inf, Ep(h), (3.21) 


where Ep(h) = fo h(@) P(d@). If we let 


h*(d) = sup h(@) andh,(d)= inf h(@), 
(@) = sup (0) and he(d) = , inf, ACO) 


then it can be shown that 
E*(h) = I h* (u) u(du) and E,(h) = I h,(u) p(du). (3.22) 


Details on these may be found in Wasserman (1990). Based on these ideas, 
some new techniques for Bayesian robustness measures can be derived when 
the prior envelopes arise from belief functions. 

Suppose Bel is a belief function on © with source (D, u,T) and C is the 
class of all prior probability measures compatible with Bel. Let L(@) = f(xl8) 
be the likelihood function of @ given the data x, and let L4(@) = L(@)I,(8), 
where I4 is the indicator function of A C O. Then we have the following result 
and its application from Wasserman (1990). 


Theorem 3.13. If L(0) is bounded and A C O, then 


| „o B By (Lae) 
LCs pei) Eda 
sup 7(Alxz) = a = eC (3.24) 


nec © E*(La)+E,(Lae)  E,(L4) + E,((Lac)s) 
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Example 3.14. Consider the class of e-contamination priors: 
C = {r:n = (1 —e)ro +q, EQ}, 


where Q is the class of all probability measures on ©. This neighborhood class 
C corresponds to the belief function with source (D, y, T), where D = OU{do}, 
u = (1 — e)no + eô, and 


_f{d} ifdee: 
(a) = | © ifd= do. 


Here 6 is a point mass on dg and 7 is a probability measure on D giving zero 
probability to do and is identical to mg on D — {do}. Then from Theorem 3.13 
above, 


= (l-e) fa £(0)70(d6) + esupge, L(9) 
sup (Ale) = Ge) fe L@)ro(d0) + esupgea Lay C7 
| i (1 — 6) fa LO)mo( db) 
Beal) ~ (Le) A LO ro (d0) + €supge ac L(0) eee) 


It may be noted that this is a different proof for the same result of Berger and 
Berliner (1986). 


3.8.3 Interactive Robust Bayesian Analysis 


Following Berger (1994), an interactive scheme for robust Bayesian analysis 
can be suggested according to the diagram Figure 3.1. The point to note is 
that, if lack of robustness is evident, then the class T of priors obtained from 
initial prior inputs has to be shrunk using further prior elicitation. Details on 
such an approach for shrinking a large quantile class of priors is described in 
Liseo et al. (1996). 


Initial Prior Inputs| ——>- —» [Sensitivity Analysis 


Fig. 3.1. Interactive robust Bayesian scheme. 





Further Prior Inputs 
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3.8.4 Other Global Measures 


As seen earlier, interpretation of the size of the range of posterior quantities 
needs to be done within the given context. However, some efforts have been 
made to derive certain generic measures also. Ruggeri and Sivaganesan (2000) 
suggest a scaled version of the range for this purpose. Suppose 79 is a baseline 
prior, and let the sensitivity of the posterior mean of a target quantity h(@) to 
deviations from 7 be of interest. Let I’ be a class of plausible priors 7 on 6. 
Assume the following notation of p” (x) = E™(h(8)|x), p°(x) = E™(h(6) |x), 
and V™(x) denoting the posterior variance of h(@) under prior 7. Then the 
relative sensitivity, denoted by R,, is defined as 


R, (x) = ee (3.27) 


The motivation for considering R, is that the posterior variance V” is a 
measure of accuracy in estimation of h(@), and hence if the squared distance of 
p” (x) from p°(z) relative to this is not too large, robustness can be expected. 
The following example which is essentially from Ruggeri and Sivaganesan 
(2000) illustrates this idea. 


Example 3.15. Let X have the N(6@,1) distribution, and under ro, let 6 be 
N (0, 2). Consider the class I of all N(0, 7?) priors with 1 < 7? < 10. Consider 
sensitivity of posterior inferences about h(@) = @ when x > Q is observed. 
Because the posterior distribution (under the prior N(0,77)) of @ given z is 
normal with mean T*z/(r? + 1) and variance r?/(7? + 1), note that 


2 2 2 
T 26,40 — T 2 (T — 2) x 
ed al a (= +1 3 eC Qr2(72 +1) 


It can then be easily checked that the range of p™(x) — p°(z) is 8x/33 and 
sup R(x) = 6.427 /99. Thus, robustness can be expected when the observation 
x lies in the range 0 < x < 4, but certainly not when z = 10. 


3.8.5 Local Measures of Sensitivity 


As can be noted from the previous section, unless the class I’ of possible 
priors is a ‘nice’ parametric class, or a class whose set of extreme points is 
easy to work with, computational complexity of global measures of robustness 
is high. Furthermore, this ‘global’ approach can become quite unfeasible for 
very complicated models. If, for example, X ~ Pg, and @ is p-dimensional, 
p > 1, then the range of posterior mean of 6; may well depend on prior 
inputs on @; for j Æ i also. If such is the case, global measures of robustness 
will involve computing ranges of posterior quantities of general functions g(@) 
over classes of joint prior distributions of 8. 

The alternative, which has attracted a lot of attention in recent years, is 
that of trying to study the effects of small perturbations to the prior. This is 


3.8 Posterior Robustness: Measures and Techniques 85 


called local sensitivity. In this approach also, one may either study the sen- 
sitivity of the entire posterior distribution or that of some specified posterior 
quantity. Let us first consider the former as in Gustafson and Wasserman 
(1995). A different set of notations as given below are needed in this section. 
Let m be a prior probability measure and let 7” denote its corresponding pos- 
terior probability measure given the data z, i.e., 7*(d0) = f(x|0@)m(d@)/m,(x) 
where mz (x) = fo f(a|9)7(d0) is the marginal density of the data. Let P be 
the set of all probability measures on the probability space (O, B). A distance 
function d: P — P is needed to quantify changes in prior and posterior mea- 
sures. Let v. be a perturbation of 7 in the direction of a measure v. Then the 
local sensitivity of P in the direction of v can be defined (see Gustafson and 
Wasserman (1995)) by 


x T 
s(n, v; z) = lim ATA) 


; i 
«lO d(T, Ve) Pr) 


Two different types of perturbations ve have been considered. The linear 
perturbation is defined as ve = (1—e)a+ev, and the geometric perturbation as 
dv. x (Æ) dr. (See Gelfand and Dey (1991) for details.) The local sensitivity 
s(n, v; x) is simply the rate at which the perturbed posterior v? tends to the 
‘initial’ posterior 7” relative to the change in the prior. As a measure of overall 
sensitivity of a class I’ of priors one may take 


s(n, l; x£) = sup s(7,7 £). 
vel 
There are many possible choices for d, the distance measure. 

(i) dry (m, v) = sup4eg |7(A) — V(A)|, the total variation distance. In this 
case s(7,v; x) for linear perturbations turns out to be the norm of the Fréchet 
derivative. ‘To see this one needs to start with the Gateaux differential of the 
posterior. To define the Gateaux differential, let 6 = r — v, ||6|| = dry (r, v) 
and define T : P > P by T(z) = nr”. The Gateaux differential of T is then 

dry (n”, v?) m(x) 


r — I CT 4 ae i 
T,, (0) ~~ im € M(x) dry (mt V ), 


because 
ve = (1—A)a* + AW’, (3.29) 


where A = A(e) = em,(x)/{(1 — e)ma, (x) + em,(x)}. Also, simply note that 
dry (m, ve) = Xe)dry(x*,v*). Further, if the likelihood function f(z|@) is 
bounded (in 0}, then T}(ô) is a linear map on signed measures such that 


T(r +ô) = T(x) + Tx(5) + o(llôll), as Illi > 0, 


uniformly over all signed measures 6 with mass 0 (see Diaconis and Freedman 
(1986)). Note then that 
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eee 


ver 





(ii) dg(z,v) = [ ¢ (ER ) a v(@), where ¢ is a smooth convex function with 


bounded first and second derivatives near 1 and such that $(1) = 0. This is 
the ¢-divergence measure of distance. (See Csiszär (1978), Goel (1983) and 
Goel (1986).) Several well-known divergence measures are special cases of ¢- 
divergence measure for different convex functions. Listed in Table 3.3 are some 
such ¢ functions and the corresponding divergence measures obtained thereof. 
(See Rao (1982) for applications of many of these measures in statistics.) 

Consider first the e-contamination class of priors (or linear perturbations), 
and note that 


. d(x*, ve) 
s(n, Vv; £) = lim aa 


Because dg(P,Q) = f ¢ (45 5) dQ, both d(T, ve) and dg(m*, vZ) converge to 0 


as € — 0. In fact, we shall see that, $- i de(T, Vve) and Żdọ(7”, v2) also converge 
to 0 as e > 0, so that on applying the L Hospital rule, we obtain 


doln”, ve) 
: — lim —2-— £2 
(m2) = OE am) 


= lim deo (n*, vë) ve) 


(3.30) 


The following theorem then follows from Theorem 3.1 of Dey and Birmiwal 
(1994). 


Table 3.3. ¢ Functions and the Corresponding Divergence Measures 


olz) |Divergence Measure 
x log(x) Kullback-Leibler 
— log(z) Directed divergence 


(x —1)log(x) |J-divergence 
5(/x—1)*  |Hellinger distance or Kolmogorov’s measure of distance 
1—2°,0<a< 1|Generalized Bhattacharya measure 


(x — 1)? Chi-squared divergence or Kagan’s measure of distance 
(z^—1) 


NOEL) A Æ 0, —-1|Power-weighted divergence 
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Theorem 3.16. Suppose that f a d0 < œ. Then 


ve (a) 











sT ng) = (3.31) 
v(0) 
va (563) 
where Va(h(0)) = f h?(0) da(0) — (h(8) da(0))?. 
Proof. In view of (3.30) above, it is enough to establish that 
d? ana v(@) 
ga tolr" Ue expo Ne (22) i (3.32) 


Recall from (3.29) that v? = h(ejn®” + (1 — h(e))v”, where h(e) = (1 — 
e)m,(x)/my,, (a) = (1 — 6)m,(2)/{T1 — €)m,(x) + em,(x)}. Now let y = 
ye(O, x) = v? (0)/z” (8), and note that 


_ h(e)n® (6) + (1 — h(o))v*(6) 























n= (6) 
v*(9), _ ¥*(@) 
MOU- eq) + O 
Therefore, 
gia (= SEH 
mian Cake) le A Ema) ig v* (0 
_ (rml) ~ a (oma (a) + mala) r9, 
_ (= 9m (a) = ma(x)my, (2) = (1> male) (a) _ 20 
my (x) TE 
and hence 
Eeo =~ a) 
_ mlz) | v* (0) 
a a 
and similarly, 
d? my (x)(mp, (2) — mx(x)) v*(@) 
gene m (6) 


Now because 





de(n, v?) = fo (E8 n” (0) dé 


= f 60, 2)) 70) db, 
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and because ¢ is a smooth function with bounded first and second derivatives 
near 1, applying the dominated convergence theorem (DCT), one obtains, 
delat ve) = | 8) Ey O) d0, and 
de a “de i 
mle) fy _ 2O 
Mr (2) n= (8) 








T ag(n?,v2)leco = —#/(1) yn? (6) d 


de 
=Q; 


* 2 . 
Further, noting that Sadeln”, v? = £ To (ye) Lyx (8) dé, and applying 
DCT once _— one obtains, 








S dla? v a= [emg (5a ato" Lr)? } x (8) a, and 
d? Hees ke O ck v* (6) x 
2 ae leno = 8 (PAE Sa- a (0) a, 
because 
p mula) (male) ~ mala) f O a 
J (iaro) = -2 m2 (a) [o-Z@row 


Further noting that 















vN _ V(9)\ f(xlon(8) 
Be (2 )-J (o) ma) A PR 
v(6)f(x|8 dé (3.34) 
ps 
mea (3.35) 


we get 


























which concludes the proof. O 
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We consider next geometric perturbations. The following theorem then 
follows from Theorem 3.2 of Dey and Birmiwal (1994). 


Theorem 3.17. Suppose that 
(i) f Qog aO) dð < œ, and 
me Vv 2 V € 
(it) f (log a) (25) n(0) dð < œ for some e > 0. Then 


Vre (log ae ) 

















smn) = a ee (3.36) 
Var (log Aa) 
Proof. As before in Theorem 3.16, it is enough to establish that 
d? 7 v(8) 
—d == 1)V,2 {1 ; 3.37 
qa dott ve )le=0 =Q (1) (108 “| (3.37) 


proving the desired result. O 


Applications of these results are similar to those of a related simpler ap- 
proach as shown below. The other approach to local sensitivity analysis is 
simply to look at variation of the curvature of @-divergence as discussed in 
Dey and Birmiwal (1994) and Delampady and Dey (1994). This turns out to 
be much easier also as shown below. Consider the class I’ of e-contamination 
priors, 

pan er ns ee ee 
Then the curvature C(q) defined by C(q) = al o( res) o(A|x) dð, under 
general regularity conditions has the form C ( = o ada (25) as seen 








previously. Similarly, if we consider the class 
lS 4nin=ceny ¢.¢e OQ}. 


then we have that C(q) = o (1) Varo (|x) (log x). Variation of these quanti- 
ties over Many parametric and nonparametric classes can be easily computed. 


The following example is from Dey and Birmiwal (1994). 





Example 3.18. Consider X|@ ~ N,(0,/) and the class of I, where under 70, 
O~ N (Ho, Xo) and Q = {q : A\q ~ Np(Ho, KX), kı<k< koh, with kı <1< 
ky. Then the posterior distribution of 0 given x under ro is 


Np (oI + Xo) tx + (I + Xo)? wo, Soll + X%)~*), 


and hence 


= nine {2trace(I + Xg)? + 4(« — wo)’ Zol + Zo) °(x — wo) } . 
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It can then be shown that C(q) attains its minimum at k = 1 and maximum 
at kı or kg. The extent of robustness will of course depend on the data x, 
smaller values of C(q) indicating robustness. 


Let Qus be the class of unimodal spherically symmetric densities q such 
that maxg q(@) < h for some specified h > 0. Consider 


I = {r:r = (1 — ero + eq, q E Qus}. 


(See Sivaganesan (1989) for details on this class.) Then, under certain reason- 
able conditions (see Delampady and Dey (1994)), 


sup C'(q) 


q(0 
=¢ (1) eA Vio (te) ( a) 


#0 o l f(2l0) ap _ 


Mao (L) v(r)>1/h J s(r) Tol) mO 











Ja f (216) d0) N (3. 38) 


where S(r) is a sphere of radius r centered at 0 and V(r) denotes its volume. 
The following example illustrates the use of this result. 


Example 3.19. Let X|@ ~ N(0,1), and under 7, 6 ~ N(0,7?), 7? > 1. Then 
Mro is the density of N(0,r? + 1). Upper bounds for C(q) (denoted by C*) 
calculated using (3.38) are listed in Table 3.4 for selected values of 7 and z. 
The extremely large values of C* corresponding with 7 = 1.1 and x = 3,4 
indicate that these data are not compatible with mo. However, the same data 
are compatible with 79 if 7 has a larger value, say 2.0. Some kind of calibration, 
however, is needed to precisely establish what magnitudes of curvature can 
be considered extreme. 


Table 3.4. Bounds on Curvature for Different Values of r and z 
T |z|C* 


2]909.3 
312.08225 x108 
411.06395 x101° 


1.5}2/1.0918 
313.7237 
41454.3244 


2.0/3|1.1186 
4|7.0946 


1.1 
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Before we conclude this discussion, it should be mentioned that there is a 
large amount of literature on gamma minimax estimation, which is a frequen- 
tist approach to Bayesian robustness. The idea here is to look for the minimax 
estimator, but the class of priors considered (for minimaxity) being the one 
identified for Bayesian robustness consideration. Let us take a very brief look. 

Recall that for any decision rule ô, its frequentist risk function is given 
by R(@,6) = EL(@,6(X)), where L is the loss function and the expectation 
is with respect to the distribution of X|@. If m is any prior distribution on 
0, the Bayes risk of 6 with respect to m is r(n,ô) = E" R(O,6). The decision 
rule 6,, which minimizes the Bayes risk r(7, ô), is the Bayes rule with respect 
to m. Under the minimax principle, the optimal decision rule 6“ (minimax 
rule) is that which minimizes the maximum of the frequentist risk R(@, ô). 
Equivalently, 6” minimizes the maximum of the Bayes risk r(7,6) over the 
class of all priors 7. Under the gamma minimax principle, if m is constrained 
to lie in a class I’, the optimal rule 69 (gamma-minimax rule) minimizes 
super T(m, 6). 

Even though there are many attractive results in this topic, we will not be 
discussing them. Extensive discussion can be found in Berger (1984, 1985a), 
and further material in Ickstadt (1992) and Vidakovic (2000). 


3.9 Inherently Robust Procedures 


It is natural to look for priors and the resulting Bayesian procedures that are 
inherently robust. Adopting this approach will eliminate the need for checking 
robustness at the end by building robustness into the analysis at the beginning 
itself. Further, practitioners can demand “default” Bayesian procedures with 
built-in robustness that do not require specific sensitivity analyses requiring 
sophisticated tools. 

Accumulated evidence indicates that priors with flatter tails than those of 
the likelihood tend to be more robust than easier choices such as conjugate 
priors. Literature here includes Dawid (1973), Box and Tiao (1973), Berger 
(1984, 1985a), O’Hagan (1988, 1990), Angers and Berger (1991), Fan and 
Berger (1992), and Geweke (1999). The following example from Berger (1994) 
illustrates some of these ideas. 


Example 8.20. Let X1,..., Xn be a random sample from a measurement error 
model, so that X; = @+e;,7=1,...,n where e; are the measurement errors. 
€; s can then be reasonably assumed to be 1.i.d. having a symmetric unimodal 
distribution with mean 0 and unknown variance o7. The location parameter 6 
is of inferential interest with the prior information that it is symmetric about 
0 and has quartiles of +1, whereas g? is a nuisance parameter with little prior 
information. 

The simple “standard” analysis would assume that X;|6,07 are i.i.d. 
N(6,07) and m(6,07) x =:71(0) where under 7, the prior distribution of 
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0 is N(0,2.19). (This may be contrasted with Jeffreys analysis discussed in 
Section 2.7.2.) This conjugate prior analysis suffers from nonrobustness as 
mentioned previously. 

Instead, assume that X;|@, o? are i.i.d. t4(@,07), and likewise assume that 
under mı, the prior distribution of 0 is Cauchy(0, 1). This analysis would 
achieve certain robustness lacking in the previous approach. Any outliers in the 
data will be adequately handled by the Student’s t model, and further, if the 
prior and the data are in conflict, the prior information will be mostly ignored. 
There are certain computational issues to be addressed here. The “standard” 
analysis is very easy whereas the robust approach is computationally intensive. 
However, the MCMC techniques that will be discussed later in the context of 
hierarchical Bayesian analysis can handle these problems. 


O’Hagan (1990) and Angers (2000) discuss some of these issues formally 
using concepts that they call credence and p-credence that compare the tail 
behavior of the posterior distribution with that of heavy tailed distributions 
such as Student’s t and exponential power density. 

Further discussion of robust priors and robust procedures will be deferred 
to Chapters 4 and 5 where we shall consider default and reference priors that 
are improper priors. 


3.10 Loss Robustness 


Given the same decision problem, it is possible that different decision makers 
have different assessments for the consequences of their actions and hence 
may have different loss functions. In such a situation, it may be necessary to 
evaluate the sensitivity of Bayesian procedures to the choice of loss. 


Example 3.21. Suppose X is Poisson(@) and @ has the prior distribution of 
exponential with mean 1. Suppose xz = 0 is observed. ‘Then the posterior dis- 
tribution of 0 is exponential with mean 1/2. Therefore, the Bayes estimator 
of @ under squared error loss is 1/2 which is the posterior mean, whereas 
the Bayes estimator under absolute error loss is 0.3465, the posterior median. 
These are clearly different, and this difference may have some significant im- 
pact depending on the use to which the estimator is being put. 


It is possible to provide a Bayesian approach to the study of loss robustness 
exactly as we have done for the prior distribution. In particular, if a class of 
loss functions is available, range of posterior expected losses can be computed 
and examined as was done in Dey et al. (1998) and Dey and Micheas (2000). 
There are also other approaches, such as that of computing non-dominated 
alternatives, which is outlined in Martin et al. (1998). 
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3.11 Model Robustness 


The model for the observables is the most important component of statistical 
inference, and hence imprecisions in the specification of the model that can 
lead to inaccurate inferences must be viewed with great concern. There has 
been a lot of work in classical statistics in this regard, but most of that only 
addresses the problem of influence of outliers with respect to a specified target 
model. In principle, Bayesian approach to model robustness need not be any 
different from that for prior robustness or loss robustness. However, the prob- 
lem gets complicated because the mapping of likelihood function to posterior 
density is not ratio-linear, and hence different techniques need to be employed 
to assess the sensitivity. If only a finite set of models need to be considered, 
the problem is a simple one and one simply needs to check the inferences 
obtained under the different models for the given data. It needs to be kept 
in mind that, even in this case, different models may be based on different 
parameters with different interpretations, and hence the specification of prior 
distributions may be a complicated problem. The following example which 
illustrates some of the possibilities is similar to Example 1 of Shyamalkumar 
(2000). (See Pericchi and Pérez (1994) and Berger et al. (2000) also.) 


Example 8.22. Suppose the quantity of inferential interest is 9, the median 
of the model. Model uncertainty is represented by considering the set of two 
models, 


M = {N(6,1), Cauchy(@, 0.675) } , 


where 0.675 above is the scale parameter of the Cauchy distribution. In other 
words, X is either N(0,1) or Cauchy(@, 0.675). Since @ is the median of the 
model in either case, it is not difficult to specify its prior distribution. Suppose 
the prior ~ lies in the class I of N(0,7?), 1 < T? < 10. The range of posterior 
means are as shown in ‘Table 3.5. 

As can be seen, model robustness is also dependent on the observed g, 
just like prior or loss robustness. In many situations, this robustness will be 
absent, and there is no solution other than providing further input on model 
refinements. 


Model robustness does have a long history even though the material is not 
very extensive. Box and Tiao (1962) have considered this problem in a simple 
setup. Lavine (1991) and Fernandez et al. (2001) have used a nonparamet- 
ric class of models, and Bayarri and Berger (1998b) have studied robustness 
in selection models. These can be considered global robustness approaches 
as compared with the approach of local robustness adopted by Cuevas and 
Sanz (1988), Sivaganesan (1993), and Dey et al. (1996). Extrema of func- 
tional derivative of the posterior quantities are studied by these authors. This 
is similar to the local robustness approach for prior distributions. Some of the 
frequentist approaches such as Huber (1964, 1981) are also somewhat relevant. 
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Table 3.5. Range of Posterior Means for Different Models 


z= 
Likelihood|inf £(6|x)|sup F(0|x) inf £(6|x)jsup E(8|z)|inf E(80|x) isup E(6|zx) 





3.12 Exercises 


l. 


(St. Petersburg paradox). Suppose you are invited to play the follow- 
ing game. A fair coin is tossed repeatedly until it comes up heads. The 
reward will be 2” (in some unit of currency) if it takes n tosses until a 
head first appears. How much would you be willing to pay to play this 
game? Show that the expected monetary return is oo, but few would be 
willing to pay very much to play the game. 


. Consider a lottery where it costs $1 to buy a ticket. If you win the lottery 


you get $1000. If the probability of winning the lottery is 0.0001, decide 
what you should do under each of the following utility functions, u(x), x 
being the monetary gain: 


(a) u(x) = x; (b) u(x) = log.(.3 + x); (c) u(x) = exp(1 + 7/100). 


. A mango grower owns three orchards. Orchard I yields 50% of his total 


produce, II provides 30% and III provides the rest. Even though they are 
all of a single variety, 2% of the mangoes from I, 1% each from II and III 
are excessively sour tasting. 

(a) What is the probability that a mango randomly selected from the total 
produce is excessively sour? 

(b) What is the probability that a randomly selected mango that is found 
to be excessively sour came from orchard II? 

(c) Consider a box of 100 mangoes all of which came from a single orchard, 
but we don’t know which one. A mango is selected randomly from this 
box and is found to be sour. What is the probability that a second mango 
randomly selected from the remaining 99 is also sour? 


. Show that the Student’s t density can be expressed as a scale mixture of 


normal densities. 


. Refer to Example 3.1. Suppose that the prior for 0 has median 1, and 


upper quartile 2. Consider the priors, 

(i) 8 ~ exponential, (ii) log(@) ~ normal and (iii) log(@) ~ Cauchy. 

(a) Determine the hyperparameters of the three priors. 

(b) Plot the posterior mean E7(6|x) for the three priors when lies in 
the range, 0 < x < 50. 


. Let X1, X2,---,Xn be a random sample from Poisson(@), where estimation 


of @ is of interest. 
(a) Derive the range of posterior means when the prior lies in the class of 
Gamma distributions with prior mean 4. 


10. 
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(b) Compute the range of posterior means if the prior density is known 
only to be a continuous non-increasing function. 


. Let X1, X2,..., Xn be iid. N(6,07), o? known. Consider the following 


class of conjugate priors for 8: l = {N(0, a ae 0}. 

(a) Find the range of posterior means. 

(b) Find the range of posterior variances. 

(c) Suppose z > 0. Plot the range of 95% HPD credible intervals. 

(d) Suppose o? = 10 and n = 10. Further, suppose that an Z of large 
magnitude is observed. If, now, a N (0,1) prior is assumed (in which case 
prior mean is far from the sample mean but prior variance and sample 
variance are equal) show that the posterior mean and also the credible 
interval will show substantial shrinkage. Comment on this phenomenon of 
the prior not allowing the data to have more influence when the data and 
prior are in conflict. What would happen if instead a Cauchy(0, 1) prior 
were to be used? 


. Let X|6 ~ N(6,1) and let sy denote the class of unimodal priors which 


are symmetric about 0. 
(a) Plot {infrersy m(z),sup,er,,, m(z)} for 0 < z < 10. 
(b) Plot {infrer,, E7(6|z),sup,er,,, E™(@|x)} for 0 < x < 10. 


. Let X1, X2,---, Xn be i.i.d. with density 


f(a|@) = exp(—(z — 9)), z > 9, 


where —co < @ < oo. Consider the class of unimodal prior distributions 
on 0 which are symmetric about 0. Compute the range of posterior means 
and that of the posterior probability that @ > 0, for n = 5 and x = 
(0.1828, 0.0288, 0.2355, 1.6038, 0.4584). 

Suppose X1, X2,..., Xn are i.i.d. N(6,07), where 0 needs to be estimated, 
but a? which is also unknown is a nuisance parameter. Let z denote the 
sample mean and s2_, = $; (z; — Z)?/(n — 1), the sample variance. 
(a) Show that under the prior 7(6, 07) œ (a?) t, the posterior distribution 


of @ is given by 
vn(0 — 2) 


Sn—1 


~ ty-1- 
(b) Using (a), justify the standard confidence interval 


Z+ tn-1(a/2)38n_i/Vn 


as an HPD Bayesian credible interval of coefficient 100(1 — a)%, where 
tn—1(a@/2) is the t,_1 quantile of order (1 — a/2). 

(c) If instead, 0|o? ~ N(u, co?) and r(a?) x (o7)—1, for specified p and 
c, what is the HPD Bayesian credible interval of coefficient 100(1 — a)%? 
(d) In (c) above, suppose c = 5 and uy is not specified, but is known to 
lie in the interval 0 < u < 3, n = 9, z = 0 and s,_; = 1. Investigate the 
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12. 


13. 
14. 


15. 
16. 


17. 
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robustness of the credible interval given in (b) by computing the range of 
its posterior probability. 

(e) Consider independent priors: 0 ~ N(y,77), n(o?) « (o*)~+, where 
0 < u <3 and 5 <7% < 10. Conduct a robustness study as in (d) now. 
Let 





zò if A >Q; 
limano ~ = log(x) if A = 0, 


and consider the following family of probability densities introduced by 
Albert et al. (1991): 


T(0lu, b,c, A) = k(e, A)\/dexp l-Em (1 ze aa ) , (3.39) 





C 


where k(c, A) is the normalizing constant, —co < u < œ, 6 > 0, c> 1, 
A > 0. 

(a) Show that 7 is unimodal symmetric about p. 

(b) Show that the family of densities defined by (3.39) contains many 
location-scale families. 

(c) Show that normal densities are included in this family. 

(d) Show that Student’s t is a special case of this density when A = 0. 
(e) Show that (3.39) behaves like the double exponential when A = 1/2. 
(f) For 0 < A < 1, show that the density in (3.39) is a scale mixture of 
normal densities. 

Suppose X| ~ N(0,07), with 0 being the parameter of interest. Explain 
how the family of prior densities given by (3.39) can be used to study the 
robustness of the posterior inferences in this case. In particular, explain 
what values of À are expected to provide robustness over a large range of 
values of X = z. 

Refer to the definition of belief function, Equation (3.20). Show that Bel 
is a probability measure iff Bel(.) = Pl(.). 

Show that any probability measure is also a belief function. 

Refer to Example 3.14. Prove (3.25) and (3.26). 

Refer to Example 3.14 again. Let X| ~ N(6,1) and let ro denote N(0, 77) 
with 7? = 2. Take « = 0.2 and suppose x = 3.5 is observed. 

(a) Construct the 95% HPD credible interval for 6 under ro. 

(b) Compute (3.25) and (3.26) for the interval in (a) now, and check 
whether robustness is present when the ¢-contamination class of priors is 
considered. 

(Dey and Birmiwal (1994)) Let X = (Xj,...,X,) have a multinomial 
distribution with probability mass function, 

P(X, = 21,°++, Xe = Zep) = ae [Ii p7, with n = D; T: and 


H3 aa 
O<p; <1, porn pi = 1. Suppose under mo, p has the Dirichlet distribu- 
tion D(a) with density 


mo(p) = =e — TE, p%!, with ap = Soe, a; where a; > 0. Now 


Ha r (ai) 
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consider the e-contamination class of priors with Q = {D(sa),s > 1}. 
Derive the extreme values of the curvature C (q). 


A 


Large Sample Methods 


In order to make Bayesian inference about the parameter 8, given a model 
f(a/@), one needs to choose an appropriate prior distribution for 6. Given the 
data x, the prior distribution is used to find the posterior distribution and var- 
ious posterior summary measures, depending on the problem. Thus exact or 
approximate computation of the posterior is a major problem for a Bayesian. 
Under certain regularity conditions, the posterior can be approximated by 
a normal distribution with the maximum likelihood estimate (MLE) as the 
mean and inverse of the observed Fisher information matrix as the dispersion 
matrix, if the sample size is large. If more accuracy is needed, one may use the 
Kass-Kadane-Tierney or Edgeworth type refinements. Alternatively, one may 
sample from the approximate posterior and take resort to importance sam- 
pling. Posterior normality has an important philosophical implication, which 
we discuss below. 

How the posterior inference is influenced by a particular prior depends on 
the relative magnitude of the amount of information in the data, which for 
i.i.d. observations may be measured by the sample size n or n/(@) or observed 
Fisher information I,, (defined in Section 4.1.2), and the amount of informa- 
tion in the prior, which is discussed in Chapter 5. As the sample size grows, 
the influence of the prior distribution diminishes. Thus for large samples, a 
precise mathematical specification of prior distribution is not necessary. In 
most cases of low-dimensional parameter space, the situation is like this. A 
Bayesian would refer to it as washing away of the prior by the data. There are 
several mathematical results embodying this phenomenon of which posterior 
normality is the most well-known. 

This chapter deals with posterior normality and some of its refinements. 
We begin with a discussion on limiting behavior of posterior distribution in 
Section 4.1. A sketch of proof of asymptotic normality of posterior is given 
in this section. A more accurate posterior approximation based on Laplace’s 
asymptotic method and its refinements by Tierney, Kass, and Kadane are 
the subjects of Section 4.3. A refinement of posterior normality is discussed 
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in Section 4.2 where an asymptotic expansion of the posterior distribution 
with a leading normal term is outlined. Throughout this chapter, we consider 
only the case with a finite dimensional parameter. Also, @ is assumed to be a 
“continuous” parameter with a prior density function. 

We apply these results for determination of sample size in Section 4.2.1. 


4.1 Limit of Posterior Distribution 


In this section, we discuss the limiting behavior of posterior distributions as 
the sample size n — oo. The limiting results can be used as approximations if 
n is sufficiently large. They may be used also as a form of frequentist validation 
of Bayesian analysis. We begin with a discussion of posterior consistency in 
Section 4.1.1. Asymptotic normality of posterior distribution is the subject of 
Section 4.1.2. 


4.1.1 Consistency of Posterior Distribution 


Suppose a data sequence is generated as i.i.d. random variables with density 
f(z/@9). Would a Bayesian analyzing this data with his prior 7(@) be able 
to learn about 69? Our prior knowledge about @ is updated into the poste- 
rior as we learn more from the data. Ideally, the updated knowledge about 
8, represented by its posterior distribution, should become more and more 
concentrated near ĝo as the sample size increases. This asymptotic property 
is known as consistency of the posterior distribution at 9). Let X,,...,Xn 
be the observations at the nth stage, abbreviated as Xn, having a density 
flan | 0), 0 E O C R. Let 7(@) be a prior density, 7(8 | Xn) the poste- 
rior density as defined in (2.1), and H(. | Xn) the corresponding posterior 
distribution. 


Definition. The sequence of posterior distributions IT(. | Xn) is said to be 
consistent at some Oo € O, if for every neighborhood U of 09, H(U | Xn) 1 
asn— co with probability one with respect to the distribution under ĝo. 


The idea goes back to Laplace, who had shown the following. If X1,..., Xn 
are i.i.d. Bernoulli with P(X, = 1) = @ and 7(9) is a prior density that is 
continuous and positive on (0,1), then the posterior is consistent at all 4 in 
(0,1). von Mises (1957) calls this the second fundamental law of large num- 
bers; the first being Bernoulli’s weak law of large numbers. Need for posterior 
consistency has been stressed by Freedman (1963, 1965) and Diaconis and 
Freedman (1986). 

From the definition of convergence in distribution, it follows that consis- 
tency of IT(. | Xn) at @, is equivalent to the fact that J7(. | Xn) converges 
to the distribution degenerate at 09 with probability one under 9p. 

Consistency of posterior distribution holds in the general case with a finite 
dimensional parameter under mild conditions. For general results see, for ex- 
ample, Ghosh and Ramamoorthi (2003). For a real parameter 6, consistency 
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at 99 can be proved by showing E(0 | Xn) > 8o and Var(@ | Xn) > 0 with 
probability one under ĝo. This follows from an application of Chebyshev’s 
inequality. 


Example 4.1. Let X1,X2,...,Xn be i.i.d. Bernoulli observations with P(X; = 
1) = 6. Consider a Beta (a, 2) prior density for 6. The posterior density of 8 
given X,,X2,...,Xn is then a Beta ($; Xi +a, n—- DO, Xi + B) density 
with 


_ Ver Xita 

= ntatG’ 

_ Èi Xi + a) (n ~ Xiz Xi + 8) 
(a+ B+n)(at+B+tn+1) — 


BO | Xi,- Xa) 
Var(6 | oC tr. 


As > Xi — bo with Po -probability 1 by the law of large numbers, it 
follows that E(@ | X1,...,Xn) > 9 and Var(@ | X1,...,Xn) > O with 
probability one under 69. Therefore, in view of the result mentioned in the 
previous paragraph, the posterior distribution of @ is consistent. 


An important result related to consistency is the robustness of the poste- 
rior inference with respect to choice of prior. Let X1,..., Xn be i.i.d. observa- 
tions. Let mı and mo be two prior densities which are positive and continuous 
at Oo, an interior point of ©, such that the corresponding posterior distri- 
butions Mı(. | Xn) and Io{. | Xn) are both consistent at @9. Then with 
probability one under 8o 


[| 108 | Xn) ~ m0 | Xn) | 40 0 


or equivalently, 


Thus, two different choices of the prior distribution lead to approximately the 
same posterior distribution. A proof of this result is available in Ghosh et al. 
(1994) and Ghosh and Ramamoorthi (2003). 


4.1.2 Asymptotic Normality of Posterior Distribution 


Large sample Bayesian methods are primarily based on normal approximation 
to the posterior distribution of 8. As the sample size n increases, the poste- 
rior distribution approaches normality under certain regularity conditions and 
hence can be well approximated by an appropriate normal distribution if n is 
sufficiently large. When n is large, the posterior distribution becomes highly 
concentrated in a small neighborhood of the posterior mode. Suppose that the 
notations are as in Section 4.1.1, and @,, denotes the posterior mode. Under 
suitable regularity conditions, a Taylor expansion of log 7(@ | Xn) at On gives 
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log r(@ | Xn) = log 7(@, | Xn) + (0 — nY — 2 log z (8 | Xn)lo, 


56 
-5(0 —õn)'Īn (0 — On) + 


ae ae (o ~6,)'I,(6 —8,,) (4.1) 


where I,, is a p x p matrix defined as 


~ 


In = (—5pag- El | Xn) land, 


and may be called generalized observed Fisher information matrix. The term 
involving the first derivative is zero as the derivative is zero at the mode 8ņ. 
Also, under suitable conditions the terms involving third and higher order 
derivatives can be shown to be asymptotically negligible as @ is essentially 
close to Ôn. Because the first term in (4.1) is free of 6, 7(@|X,,), as a function 
of 8, is approximately represented as a density proportional to 


E a T 
exp[—5 (8 — On) In(0 — 0n), 


which is a Np(Ôn, In s density (with p being the dimension of 0). 

As the posterior distribution becomes highly concentrated in a small neigh- 
borhood of the posterior mode 9, where the prior density 7(@) is nearly 
constant, the posterior density 7(@ | Xn) is essentially the same as the like- 
lihood f(X, | @). Therefore, in the above heuristics, we can replace 0, by 
the maximum likelihood estimate (MLE) 6,, and I,, by the observed Fisher 
information matrix 


2 


= (- ð 
” ` 00:80; 


A 


an Oe awl) la, 


so that the posterior distribution of @ is approximately N, (On; A T 

The dispersion matrix of the approximating normal distribution may also 
be taken to be the expected Fisher information matrix I(@) evaluated at 6, 
where I(@) is a matrix defined as 


1(0) = Eo ( -zg los f(%n | 0). 


Thus we have the following result. 


Result. Suppose that X1, X2,..., Xn are 1.i.d. observations, abbreviated as 
Xn, having a density f(x, | 0), 0 E€ OC RP. Let 2(@) be a prior density and 
m(@ | Xn) the posterior density as defined in (2.1). Let 6,, be the posterior 
mode, 6, the MLE and I, I, and I(@) be the different forms of Fisher 
information matrix defined above. Then under suitable regularity conditions, 
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for large n, the posterior distribution of @ can be approximated by any one of 
ces | A i x 

the normal distributions N,(@n,I,, ) or Np(On, In ) or Np(@n, I> (On). 

In particular, under suitable regularity conditions, the posterior distribu- 

~1/2 A 

tion of I T (0 — 0n), given Xn, converges to Np(0, I) with probability one 
under the true model for the data, where J denotes the identity matrix of 
order p. This is comparable with the result from classical statistical theory 


that the repeated sampling distribution of I 29 — 6,,) given @ also converges 
to N,(0, J). 

For a comment on the accuracy of the different normal approximations 
stated in the above result and an example, see Berger (1985a, Sec. 4.7.8). 

We formally state a theorem below giving a set of regularity conditions 
under which asymptotic normality of posterior distribution holds. 

Posterior normality, in some form, was first observed by Laplace in 1774 
and later by Bernstein (1917) and von Mises (1931). More recent contributors 
in this area include Le Cam (1953, 1958, 1986), Bickel and Yahav (1969), 
Walker (1969), Chao (1970), Borwanker et al. (1971), and Chen (1985). Ghosal 
(1997, 1999, 2000) considered cases where the number of parameters increases. 
A general approach that also works for nonregular problems is presented in 
Ghosh et al. (1994) and Ghosal et al. (1995). 

We present below a version of a theorem that appears in Ghosh and Ra- 
mamoorthi (2003). For simplicity, we consider the case with a real parameter 
@ and i.i.d. observations Xj,...,Xn.- 

Let X1, X2,..., Xn be i.i.d observations with a common distribution Pg 
possessing a density f(xz|@) where 6 € O, an open subset of R. We fix bo € O, 
which may be regarded as the “true value” of the parameter as the prob- 
ability statements are all made under ĝo. Let 1(6,x2) = log f(z|@),L,(@) = 
$>: (0, X;), the log-likelihood, and for a function h, let h) denote the ith 
derivative of h. We assume the following regularity conditions on the density 
f(z/@). 

(A1) The set {x : f(x|@) > 0} is the same for all 0 € O. 

(A2) 1(@,x) is thrice differentiable with respect to @ in a neighborhood 
(09 — ô, ĝo + 5) of 8o. The expectations E»,1)(09,X1) and Eo,l'*) (00, X1) 
are both finite and 


sup 16,2) < M(x) and Ey, M(X1) < œ. (4.2) 
8€(89—4,89+56) 


(A3) Interchange of the order of integration with respect to Ps, and differen- 
tiation at ĝo is justified, so that 


Egl (09, X1) =0, Eel (80, X1) = — Eo, (Ul (80, X1))?. 


Also, the Fisher information number per unit observation (ĝo) = 
Eg, (i (00, X1))? is positive. 
(A4) For any ô > 0, with P9,-probability one 
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‘sup =(Ln(8) — En (00) < -€ 
|@—@o|>6 P 


for some € > 0 and for all sufficiently large n. 


Remark: Suppose there exists a strongly consistent sequence of estimators 6, 
of 0. This means for all 69 € O, On — ĝo with Po -probability one. Then by 
the arguments given in Ghosh (1983), a strongly consistent solution 6, of the 
likelihood equation pW (0) = 0 exists, i.e., there exists a sequence of statistics 
6, such that with P9,-probability one bn satisfies the likelihood equation for 
sufficiently large n and bn, — bo. 


Theorem 4.2. Suppose assumptions (A1) — (A4) hold and 6, is a strongly 
consistent solution of the likelihood equation. Then for any prior density n(8) 
which is continuous and positive at 0o, 


. I(60 —4¢71(8 
] "(X Crees, = (90)| dt =0 4.3 
dim, f eiea t Wer BP) (4.3) 


with Pg,-probability one, where m*(t|X1,...,Xn) is the posterior density of 
t = /n(0—9,) given X1,...,Xn. : 
Also under the same assumptions, (4.3) holds with I(@9) replaced by I, = 
(On): 


A sketch of proof. We only present a sketch of proof. Interested readers may 
obtain a detailed complete proof from this sketch. 

The proof consists of essentially two steps. It is first shown that the tails 
of the posterior distribution are negligible. Then in the remaining part, the 
log-likelihood function is expanded by Taylor’s theorem up to terms involving 
third derivative. The linear term in the expansion vanishes, the quadratic term 
is proportional to logarithm of a normal density, and the remainder term is 
negligible under assumption (4. a) on the third derivative. 


Because 7,,(9|X1,..., Xn) X I f(X;|@)7(@), the posterior density of t = 
/n(0 — 6,,) can be written as 
m*(t|X1,...,Xn) = Cy t(On + t/Vn) exp[Ln(On + t/Vn) — Ln(n)] (4.4) 


where solve + t/v) exp[Ln (Ôn +t/Vn) — Ln(On)| dt. 


Most of the statements made below hold with Po -probability one but we will 
omit the phrase “with Pa -probability one”. 
Let 


galt) = n(n + t/ Vn) exp[Ln(On + t/Vn) — Ln(On)] — 7(Oo)e7 22 7), 


We first note that in order to prove (4.3), it is enough to show 
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J E E (4.5) 
R 


If (4.5) holds, Cn — n(o) 2r/I(o) and therefore, the integral in (4.3), 
which is dominated by 


i IOo) _142 
Ce J Ign (t)ldt + [ JO tm (Boje 3° 1%) — 4} a eran iene 
R R n 


also goes to zero. 
To show (4.5), we break R into two regions A, = {t : |t| > ôoyn} and 
Az = {t : |t| < oyn} for some suitably chosen small positive number ĝo and 
show for 7 = 1, 2. 


/ Rae 4G: (4.6) 
A; 


To show (4.6) for 7 = 1, we note that 


J lgn(t)| at 
Ay 


= J n(n +t//n) exp[Ln(On +t/ Vn) — Ln(On)] dt + / (Oy )e7 2% 190) dt, 
Ay A 


i 


It is easy to see that the second integral goes to zero. For the first integral, 
we note that by assumption (A4), for t € Aj, 


“(bn lta TO. eae 


for all sufficiently large n. It follows that (4.6) holds for i = 1. 
To show (4.6) for i = 2, we use the dominated convergence theorem. 


Expanding in Taylor series and noting that L (Ôn) = 0 we have for large n, 


A t A 1 2¢ 
Ln (On + Ta = Lin (On) nt -3t In + R(t) (4.7) 


where R,(t) = (t/n) LE (01) and 6!, lies between 6, and 6, + t/ V7. 
By assumption (A2), for each t, R,(t) > 0 and J, —> I(9) and therefore, 
gn(t) -» 0. For suitably chosen ôo, for any t € Ag, 


EA eo oe 
ee ee 5 eee ler 
|Rn(t)| < glot n 2- M(X;) < zt a 


for sufficiently large n so that from (4.7), 


exp[[n(On + t/Vn) — Ln(6n)] < e748 < e7810), 
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Therefore, for suitably chosen small ĝo, |gn(t)| is dominated by an integrable 
function on the set Ag. Thus(4.6) holds for i = 2. This completes the proof of 
(4.3). The second part of the theorem follows as In > I(6o). O 


Remark. We assume in the proof above that 7(@) is a proper probability 
density. However, Theorem 4.2 holds even if m is improper, if there is an no 
such that the posterior distribution of 9 given (£1, %2,...,2n,) is proper for 
Be (Ti bai; ing): 


The following theorem states that in the regular case with a large sample, a 
Bayes estimate is approximately the same as the maximum likelihood estimate 
Ôn. If we consider the squared error loss the Bayes estimate for @ is given by 


the posterior mean 0% = f Onn (OX, ..., Xn) dô. 
8 


Theorem 4.3. In addition to the assumptions of Theorem 4.2, assume that 
that prior density n(0) has a finite expectation. Then /n(0% — 0n) + 0 with 
Po, -probability one. 


Proof. Proceeding as in the proof of Theorem 4.2 and using the assumption 
of finite expectation for 7, (4.3) can be strengthened to 


t| lri (t|X1,...,Xn) — at (0) | dt — 0 
[ellen (el... Xa) — re HH] at 





with Po -probability one. This implies 
J 10 2 
i tr*(t|X1,...,Xn) dt >| p V80) .- 4471000) dt = 0, 
R R [Vr 


Now 6* = E(0|X1,..., Xn) = Eln + t/vVn]Xi,..., Xn) and therefore, 
VEO- bn) = f tn*(t|X,,...,X,)dt+0. 0 
R 


Theorems 4.2 and 4.3 and their variants can be used to make inference 
about @ for large samples. We have seen in Chapter 2 how our inference can 
be based on the posterior distribution. If the sample size is sufficiently large, 
for a wide variety of priors we can replace the posterior distribution by the 


approximating normal distribution having mean 6, and dispersion I - i or 
(nI,)~! which do not depend on the prior. Theorem 4.3 tells that in the 
problem of estimating a real parameter with squared error loss, the Bayes 
estimate is approximately the same as the MLE 6,,. Indeed, Theorem 4.3 
can be extended to show that this is also true for a wide variety of loss 
functions. Also the moments and quantiles of the posterior distribution can 
be approximated by the corresponding measures of the approximating normal 
distribution. We consider an example at the end of Section 4.3 to illustrate the 
use of asymptotic posterior normality in the problems of interval estimation 
and testing. 


4.2 Asymptotic Expansion of Posterior Distribution 107 
4.2 Asymptotic Expansion of Posterior Distribution 


Consider the setup of Theorem 4.2. Let 


Falu) = Hn ({ vni} (6 — ôn) < u}|Xi,--., Xn) 
be the posterior distribution function of yn Fal *(@ — 6,,). Then under certain 
regularity assumptions, F,(u) is approximately equal to (u), where @ is 
the standard normal distribution function. Theorem 4.2 states that under 
assumptions (A1)-(A4) on the density f(x|0), for any prior density 7(@) which 
is continuous and positive at 0o, 


lim sup |F,(u) — (u)! = 0 as. Po- (4.8) 
noo u 


Recall that this is proved essentially in two steps. It is first shown that the tails 
of the posterior distribution are negligible. Then in the remaining part, the 
log-likelihood function is expanded by Taylor’s theorem up to terms involving 
third derivative. The linear term in the expansion vanishes, the quadratic 
term is proportional to logarithm of a normal density, and the remainder 
term is negligible under assumption (4.2) on the third derivative. Suppose 
now that (0, x) = log f(x|0) is (k + 3) times continuously differentiable and 
m(9) is (k+1) times continuously differentiable at 0o with m (80) > 0. Then the 
subsequent higher order terms in the Taylor expansion provide a refinement of 
the posterior normality result stated in Theorem 4.2 or in (4.8) above. Under 
conditions similar to (4.2) for the derivatives of 1(6, x) of order 3,4,...,k +3, 
and some more conditions on f(x|@), Johnson (1970) proved the following 
rigorous and precise version of a refinement due to Lindley (1961). 





k 
sup |F (u) — Bu) — olu) N y(u; Xparag Xnjn VA < Men FTU (49) 


eventually with Pp -probability one for some Mp > 0, depending on k, where 
ġ(u) is the standard normal density and each Y; (u; X1,..., Xn) is a polyno- 
mial in u having coefficients bounded in X1,..., Xn. 

Under the same assumptions one can obtain a similar result involving the 
L; distance between the posterior density and an approximation. 

The case k = 0 corresponds to that considered in Section (4.1.2) as (4.9) 
becomes 

sup |F,,(u) — &(u)| < Mon t. (4.10) 


Another (uniform) version of the above result, as stated in Ghosh et al. (1982) 
is as follows. Let O, be a bounded open interval whose closure O4 is properly 
contained in © and the prior 7 be positive on O1. Then, as stated in Ghosh et 
al. (1982), for r > 0, (4.9) holds with Pp -probability 1 — O(n~"), uniformly 
in 69 € O; under certain regularity conditions (depending on r) which are 
stronger than those of Johnson (1970). 
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For a formal argument showing how the terms in the asymptotic expansion 
given in (4.9) are calculated, see Johnson (1970) and Ghosh et al. (1982). For 
example, if we want to obtain an approximation of the posterior distribution 
upto an error of order o(n—!), we take k = 2 and proceed as follows. This is 
taken from Ghosh (1994). 


Let t = /n(6 — ôn) and a; = 1 Phn (6) Oy i > 1, so that ag = —I,. The 
posterior density of t is given by (4.4) and by Taylor expansion 


t(On £ t/ Vn) ms m(On)(1 + a y2 ee) 1! (8n ) ce ery T” (Ôn) 


(On ) 2 t(On) ae) 


and 


a 1 1 1 
Ln(On + t/Vn) — Ln (On) = -taz + on tas + ag faa +o(n7"). 


2 
Therefore, 
n(n + t/vn) exp[Ln(On + t/ Vn) — Ln(ôn)] 
= n(n) explagt? /2] 


x £ DA n`! a(t; Xi,--., Xn) + nao(t; D.C eee Xn)! + o(n™t), 








where 
1 t' (On) 
i, Ce, Oe tes Ha i. 
1 1 1 on” (Ên) 
ENX penr E a a4 e 
ao ( 1 ; ) oA a4 + 79 ag Ss 9 talb) 
rhe 4 n’ (On) 
6 T (On) 


The normalizer C, also has a similar expansion that can be obtained by 
integrating the above. The posterior density of t is then expressed as 


AGN ieee) = (Qn) 2 fi/2e-#/2 
2 

x [1+ Son 9/24, (8; Xi. Xn) | +0(n74), 
j=l 


T (Ên) 


where yı (t;,.X1,..., Xn) = at? ag + toe rr ie 


) and 

















l l L, 
EN ick — ¢4 Ee x - =f) a 
y(t; Xı n) = ara ae n R 
as 15 3 1 a”(Êa) 1 m(n) E 
-5 — az — — += — +o(n 
8a? 72a * 2az (On) 2a} (8) T 
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Transforming to s = ji i we get the expansion for the posterior density of 
Jani *(@ — 6,,), and integrating it from —oo to u, we get the terms in (4.9). 
The above expansion for posterior density also gives an expansion for the 
posterior mean: 


. : (6, 
E(0|X1,...,Xn) = ôn +n 7s} |S 4 (Gn) + o(n~3/?), 
2 TOn) 


Similar expansions can also be obtained for other moments and quantiles. 
For more details and discussion see Johnson (1970), Ghosh et al. (1982), and 
Ghosh (1994). Ghosh et al. (1982) and Ghosh (1994) also obtain expansions 
of Bayes estimate and Bayes risk. These expansions are rather delicate in 
the sense that the terms in the expansion can tend to infinity, see, e.g., the 
discussion in Ghosh et al. (1982). 

The expansions agree with those obtained by Tierney and Kadane (1986) 
(see Section 4.3) up to o(n7°?). Although the Tierney-Kadane approximation is 
more convenient for numerical calculations, the expansions obtained in Ghosh 
et al. (1982) and Ghosh (1994) are more suitable for theoretical applications. 

A Bayesian would want to prove an expansion like (4.9) under the marginal 
distribution of X1,...,X, derived from the joint distribution of X’s and @. 
There are certain technical difficulties in proving this from (4.9). Such a result 
will hold if the prior 7(@) is supported on a bounded interval and behaves 
smoothly at the boundary points in the sense that 7(@) and (d'/d6*)7(6), 
i = 1,2,...,k are zero at the boundary points. A rather technical proof is given 
in Ghosh et al. (1982). See also in this context Bickel and Ghosh (1990). 
For the uniform prior on a bounded interval, there can be no asymptotic 
expansion of the integrated Bayes risk (with squared error loss) of the form 
ao + 2 + & + o(n~*) (Ghosh et al. (1982)). 


4.2.1 Determination of Sample Size in Testing 


In this subsection, we consider certain testing problems and find asymptotic 
approximations to the corresponding (minimum) Bayes risks. These approx- 
imations can be used to determine sample sizes required to achieve given 
bounds for Bayes risks. 

We first consider the case with a real parameter 0 € O, an open interval 
in R, and the problem of testing 


Ho : 0 < @ versus Hı : 0 > @ 


for some specified value ĝo. Let X4, ... Xn bei.i.d. observations with a common 
density f({x|0) involving the parameter 0. Let 7(@) be a prior density over O 
and 7(0|x) be the corresponding posterior. Set 


Ro(a) = P(0 > Oo|x) = J , 7 (O\a)dé, 


R(x) = 1 — Ro(z). 
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As mentioned in Section 2.7.2, the Bayes rule for the usual 0 — 1 loss (see 


Section 2.5) is to choose Ho if Ro(X) < R,(X) or equivalently Rı(X) > 5 
and to choose H, otherwise. The (minimum) Bayes risk is then given by 


r(m) = | Po[Ri(X) > 1/27 (0)d0 + | Po[Ry(X) < 1/2]n(0)d0. 
6>66 O<Go 


(4.11) 
By Theorem 2.7 an alternative expression for the Bayes risk is given by 
r(w) = E|min{ Ro( X), R, (X)}] (4.12) 


where the expectation is with respect to the marginal distribution of X. 

Suppose |0 — 69! > 6 where 6 is chosen suitably. For each such 9, 6,, is close 
to 6 with large probability and hence |, — 6o] > 6. Intuitively, for such Ê, it 
will be relatively easy to choose the correct hypothesis. This suggests most of 
the contribution to the right hand side of (4.11) comes from @ close to ĝo, i.e., 
from |8 — 9| < ô. A formal argument that we skip shows 


r(n) = | Po[Fu(X) > 1/2|m(6)d6 
Bo <6<Og9+6n 
+ f Pil R(X) <1/2}n(8)d0 + 0(n-!), (4.13) 
69 —bn <O<G 9 


if ôn = cVlogn/./n with c sufficiently large. You are invited to verify this for 
the N(@,1) model in Problem 7. 
An approximation to the first integral of (4.13) can be obtained as follows. 


By the result on normal approximation to posterior stated in the paragraph 
following (4.10), 


R(X) = Plnly!?(0 — On) < Vaha (Oo — 8n)|X! 
can be approximated by Bl nip! *(09 — 6n)]. Hence 
Po[Ri(X) > 1/2] ~ Po[O(Vnl7/? (A — On)) > 1/2] 


= Pa[/ni/?(6, — 0) < —/ni*/2(0 — 8 )] 
~ bl—/nI'!?(6)( — 8o)). 


Indeed, using appropriate uniform versions of the results on asymptotic ex- 
pansions of posterior distribution (as stated above) and sampling distribution 


of Vni (On — 0) given 8 (see, e.g., Ghosh (1994)), one obtains 
P| R(X) > 1/2] = &[—VnI"/?(6)(6 — 69)} + o(n71/?) 


uniformly in 0 belonging to bounded intervals contained in ©. Thus 


I Py{Ri(X) > 1/2] (8) d0 
8o <0 <o +őn 


= f &{—./nI"/?(6)(@ — boir (0) dd + o(n7 1/2). 
99 <O9<Oo+dn 
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With similar approximation for the second integral of (4.13), we have 
r(n) = f @|—nl™/?(6)(8 — bo)]r (0) db 
09<60<0@9+6, 


+ B[VnI™/?(8)(6 — 0)|m (8) dO + o(n-1/?) 
09 —-b6n <O<00 
~ a o<t<cy/log n @[—-t1'/? (8o oI t/v n]a (0 oh t//n) dt 


i 1/2 n) | (Oo n o(n~1/?), 
"Vn agaa PA a] 8 A 


If we assume 7(@) and J(@) have bounded derivatives in some neighborhoods 
of ĝo, the above reduces to 





E _ T(8o) i _4r1/2 T (0o) Í 1/2 oln- 2 
(7m) 7 | H(—t1"/(65) dt + TE [eu (69) dt + o(n-¥/?) 
— 2n(A)C 


= 2 +0(n-/*), (4.14) 


/ nl (Oo) 


where C = fo [L — ®(u)|du ~ 0.3989423. 

From (4.14) it follows that if one wants to have Bayes risk at most equal to 
some specified ro then the required sample size ng with which one can achieve 
this (approximately) is given by 





4C*(m(O0))? 


no 2 ~ r2I(00) (4.15) 


In the same way we can handle, say, a two-parameter problem with pa- 
rameter 8 = (61,62). Suppose 6; and 62 are comparable and the quantity of 
interest is 7 = 0, — b2. 

The problem is to test 

Ho : n < No 
for some specified no. Let 7(@) be the joint prior density of 61, 02 and p(n) 
be the marginal prior density of 7. Let I, be the observed Fisher information 
matrix as defined in the first part of Subsection 4.1.2. Then a normal approxi- 


a al 
mation to the posterior distribution of 8 is N(@,,,I,, ), vide Subsection 4.1.2. 
This implies that a normal approximation to the posterior of 7 is given by 
N (Oin _ Oon, Un) with 
wid. a22 ~12 
vn =f, +1, = 


n n 


where I js denotes the (i, j)th element of I, "Note that (nun) 712 —> b(0) = 
[711 (0) + 1?2(0) — 2112 (0)] 712 under @ where J*/(@) denotes the (i, 7)th ele- 
ment of J~'(0), I(@) being the expected Fisher information matrix per unit 
observation. 
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Let 7*(8) and 2*(7|G) be respectively the marginal prior density of 8 = 
6, +82 and conditional prior density of n given @ and a(n, 8) be b(@) expressed 
in terms of 7 and 8. Then by arguments similar to those used above, an 
approximation to the Bayes risk for this problem is 


n(n) = f xia) f i- atao plat} x*(B)a3 


_ 2C n*(nol8) n* 


where C is as in (4.14). 


It would be a matter of taste whether one would use simulation or asymp- 
totic approximation. In any case, each method can confirm the accuracy of 
the other. Advantage of asymptotics is that we get an overview quickly. In 
specific cases, simulation may be a more efficient alternative, and asymptotics 
can be used to confirm calculation. 


Example 4.4. Let the observations X1,..., Xn be i.i.d. B(1,6),0 < 8 < 1, and 
suppose we want to test Ho : 0 < 1/2 versus H; : 0 > 1/2. 
If we consider the uniform prior 7(@) = 1 on (0,1), we have 


=: _ Pin+ 2) i n—-T 
Ro(X) = Ro(T) = Tir+Drm_-T+) i a” (1 —6)"~* do 


which is a function of T = };—; Xi, and the marginal distribution of T is 
uniform over {0,1,...,n}. Then from (4.12) the Bayes risk is given by 


Low. 
r(r) = a > min{Ro(t), 1 — Ro(t)}. 
Here 7(0) = [6(1 — 6)]~! and the approximation suggested in (4.14) is 


i ee eee 
(n= — | [1 — &(u)]du. 


Table 4.1 gives the exact values of Bayes risk r(r) and its approximation 
r*(a) for different values of n. If one wants to have Bayes risk at most equal 
to ro = 0.04, from the approximate formula (4.15), the required sample size 
n is at least 100 while the exact expression for r(7) yields n > 99. 


The above calculations are relevant in the planning stage, when there are 
no data. If we have a sample of size n and want to control the posterior Bayes 
risk by drawing m additional observations, we can follow a similar procedure 
replacing the prior by the posterior from the first stage of data. Ideally, the 
first-stage sample would be a pilot sample of relatively small size, and the 
bulk of the data would come from the second stage. In this case, we may even 
allow an improper noninformative prior for one-sided alternatives. 
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Table 4.1. The Exact Values of Bayes Risk r(7) and Its Approximation r* (r) for 
Example 4.4. 


30 40 50 60 70 80 90 100 150 200 250 





4.3 Laplace Approximation 


Bayesian analysis requires evaluation of integrals of the form 


J 9(0)F(x\0)x(0) a9 


where f(x|@) is the likelihood function, 7(@) is the prior density, and g(@) 
is some function of 0. For example, with g(@) = 1 we have the integrated 
likelihood required for calculation of Bayes factor in testing or model selection. 
Various other characteristics of posterior and predictive distribution may also 
be expressed in terms of such integrals. Laplace’s method (see Laplace (1774)) 
is a technique for approximating integrals when the integrand has a sharp 
maximum. 


4.3.1 Laplace’s Method 


Let us consider an integral of the form 


I =| q(0)e””® do 


where q and h are smooth functions of 6 with A having a unique maximum 
at 6. In applications, nh(@) may be the log-likelihood function or logarithm 
of the unnormalized posterior density f(x|@)1(0), and Ô may be the MLE 
or posterior mode. The idea is that if h has a unique sharp maximum at ô, 
then most contribution to the integral J comes from the integral over a small 
neighborhood (Ô — 8,0 + 5) of 6. We study the behavior of I as n => oo. As 
n — oo, we have 


+ô 
Ixl = | qg(A)je™ do. 
6—6 
Here I ~ I; means I/I — 1. Laplace’s method involves Taylor series expan- 
sion of q and h about 0, which gives 


645, | etn g . 
In f a) + (8 — 8) (0) + = (6 — 0)*q"(6) + smaller terms 
6-65 


x exp [nh(6) + nh'(6)(6 — 6) + 5h" (6)(6 — 6)? + smaller terms 
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. pod . i . 1 : l . 
merda) [| + 0- DOA + 50 - DO) 
x exp 5h’ (6)(6 — ô)? dé. 
2 
Assuming that c = —h”(6) is positive and using a change of variable t = 


Jnc(@ — 8), we have 





oe b/nc - . 2 . R a 
r~ ea | j+ d ()/a(6) + OLG] et dt 





TLC =ne 
Pe e” RÔ) VAT 6) ies q" (6) 
yne 2ncq(@) 


= orn E48) [1+ O(n~*)]. 


In general, for the case with a p-dimensional parameter ©, 
t= e” hÊ) (2g) P/2n7P/2 det(A, (8)) aAA + O(n7?)) (4.16) 


where A, (0) denotes the Hessian of ~A, i.e., 


An(8) = (- zO) 


Example 4.5. (Sterling’s approximation to nf) Note that n! can be written as 
a gamma integral 


PXp 


OO oO 
n! =I(n+1)= f e "2° dr = f enlogr=a/n) de. 
0 0 


One can use the Laplace method described above to approximate n! as (Prob- 
lem 9) 
n! ~ nter Sr. 


The Bayesian Information Criterion (BIC) 


Consider a model with likelihood f(x|0) and prior 7(@). Equation (4.16), with 
q = 7 and nh(@) equal to the log-likelihood, yields an approximation to the 
integrated likelihood that can be used to find an approximation to the Bayes 
factor defined in (2.11). Schwarz (1978) proposed a criterion, known as the 
BIC, based on (4.16) ignoring the terms that stay bounded as the sample size 
n —> oo. The criterion given by 


BIC = log f(x/8) — (p/2) logn 


serves aS an approximation to the logarithm of the integrated likelihood of 
the model and is free from the choice of prior. 
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Connection Between Laplace Approximation and Posterior 
Normality 


Posterior normality discussed in Section 4.1.2 and Laplace approximation are 
closely connected. The proof of posterior normality is essentially an applica- 
tion of Laplace approximation with a rigorous handling of the error term. We 
illustrate this below by re-deriving posterior normality by an application of 
Laplace approximation. 

Let X 1, X2,..., Xn be ii.d. observations with a density f(xz|@) and 6 be 
the MLE of @. We will find an approximation to the posterior distribution of 
t = /n(@ — 0) using Laplace’s method. Let 7(@) be the posterior density and 
IT(-|a%) denote the posterior distribution. Then for a > 0, 


H(—a <t < ajz) = (6 -—a//n <9 <6+a/Vn\x) 


where 


6+a//n 
I= f e™(9) (6) dO, In = I eO n(0) dd, 
6—a/ Vn 


and h(#) = 2L(8) = } X log f(Xil8). 


As obtained above ae 
In ~ eO (8) 20 vne, 


with c = —h” (0) which is observed Fisher information per unit observation. 
Using Laplace’s method for J, we have 


7 B+a//n ; p 
Jpeg | [7 (0) + (8 — 8)x’(8) + smaller terms] 
6—a//n 


X exp |-nc(0 — 6)? /2| dé 


X . 6+a//n p 
~ Onc) | exp |-nc(0 -— ô)? /2] dé 
d—a//n 


A, on | a 2 
— prh(0) (6) —ct* /2 dt 
€ T = € . 
vn i 


—a 


Thus, for a > 0, 


Je 


a 
T(-a<t<alx) ~ —= et /2 dt 


VAT Jä 
= P(—a < Z < a) where Z ~ N(0,c7"). 


4.3.2 Tierney-Kadane-Kass Refinements 


Suppose 
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x _ J 9(9) f(x|0)n(8) d8 
PO) = F Faa 


is the Bayesian quantity of interest where g, f, and 7 are smooth functions of 
0. If we express (4.17) as 


(4.17) 


fg g(0 je” e) de 
E” (g(@) Ix) = ae d@ 


with h(@) = + log{ f(x|@)7(@)} and apply the Laplace approximation (4.16) 
to both the numerator and denominator (with q equal to g and 1), we obtain 
a first-order approximation 


E” (g(0)|x) = 9(8) {1+ O(n} 


(here Ô denotes the posterior mode). This has been derived by Tierney and 
Kadane (1986), Kass et al. (1988), and Tierney et al. (1989). 

Suppose now that g in (4.17) is positive, and let nh(@) = log f(x|@) + 
log 7(@), nh*(@) = nh(@) + log g(@) = nh(@) + G(@), say. Now apply (4.16) 
to both the numerator and denominator of (4.17) with q equal to 1. Then, 
letting @ denote the mode of h*, 5 = A; + (ô), SA (6*), Tierney and 
Kadane (1986) obtain the surprisingly accurate approximation 

|5*|/? exp (nh*(6*)) 
E™ (g(@)|x) = ————>——_~ {1+ O(n™*)}. (4.18) 
|7}1/2 exp (nh(6)) 


It is shown below how the approximation (4.18) is obtained in Tierney and 
Kadane (1986) for the case with a real parameter. 

Let o? = —1/h”(ĝ), o*? = —1/h* (6*). Also let hy = hy(O) and h* = 
hx (6*) where wx(0) = (d/d0)¥y(0) for any function Y(0). Note that under the 
usual regularity conditions o,0*,h,, hj are all of order O(1). 

Consider first the denominator of (4.17), which can be written as 


nh(8) 49 — Bsn NE Pin, BND 
Je dé [ow jnh(6) 552 (6 — 0) + Rn(6)| 

= MO Vaan"? | exp(R n(9))0(0; 0, o? /n) db 
where $(6;0,0?/n) is the N(@,02/n) density for 6 and 


Rn = nh(0) — nh(6) + 55 — (0 - 6)? 





; 9 By 
a m a 


Using the expansion of e” at zero and the expressions for moments of a normal 
distribution, we can obtain an approximation of order O(n~") for any r > 1. 
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Retaining terms upto the 6-th derivative hg in the expansion of Rn, Tierney 
and Kadane (1986) obtain 


je do =e" YZnon (3 Eog ae + o(n-*)) (4.19) 
Tt n 
where i 
ot pA 
c= ac ha + oon Phe: 
1 35 7 35 385 
b = —o®h o®h? oÊhsh oO hehe oe. 
Wg COP Beg A GO Valor a MaMa Gag S 


We have an exactly similar approximation for the numerator of (4.17) with o 
and hy, replaced by o* and hz. We then have 


E” (g(0)|x) 
O E et et E. 
=F EO eT 


a*—-a b*—b-—-alař*—a O 
praba e os), 





= z exp{n(h*(6*) — h(8))} (1 + 


Now note that 


ee 


nt (6 

WCB) + (1/n)G"(6") . 

~ H (Â) + (Ô — 8)h"(B) + (nG) + (1/n)(6* - GO) 
= (6 — Ô)” (®) + (1/n)G"(6)) + (1/n)G' Ê) 


which implies 6* — 6 = O(n7!). This, together with the fact that hi(@) = 
hk(0) + (1/n)G; (8), implies a* — a and b* — b are both of order O(n™+). It 
then follows that 


E" (g(8)}x) = © exp{n(h*(6") — h(6))} (1 + O(n); 


Example 4.6. We consider the data in Table 2.1 presented in Example 2.3. 
This is a set of data on food poisoning and we focus on the main suspect, 
namely, potato salad. Separately for Crabmeat and No Crabmeat, we wish to 
test the null hypothesis that there is no association between potato salad and 
illness. 

Let pı be the probability of being ill given that potato salad is taken and 
po be the same given no potato salad. If X; denotes the number of people 
falling ill out of a total of nı people taking potato salad and Xə denotes the 
same out of a total of ng people taking no potato salad, then X, and Xə may 
be modeled as independent binomial variables with X; following B(n;, pi), 
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i = 1,2. The test for no association between potato salad and illness is then 
equivalent to testing Ho : pı = po. 

We first carry out the test through credible intervals for pı —pz as described 
in Section 2.7.4. In order to obtain an exact Bayes test we have to choose prior 
densities for py and po. We have seen in Example 2.2 that the choice of a Beta 
prior for a binomial proportion simplifies the calculation of posterior. If we 
consider a Beta (a;, 3;) prior for p;, i = 1, 2, the posterior density of 9 = p,—po 
can be obtained as 


m(O\Xy, X2) X 


1 
/ (0 Epo taci] _ 8 — po)” Xith- nar aa — pg)”27%2+82—1 dpz 
0 


which can only be numerically calculated for a given @. Because the sample 
size here is sufficiently large, we will, however, find an approximation to the 
posterior distribution using asymptotic posterior normality. This does not 
involve specification of the prior distributions. One can easily calculate the 
Fisher information matrix I,, and show that the approximate distribution of 
6 = (pı — p2) is N (a,b?) where 


a= pi — po, b? = pi(1 — p1)/mi + fo(1 — p2)/n2, pr = X1/m1, P2 = Xo/nz. 
A 100(1 — a)% HPD cedible interval for 6 is then 
a— bzaj2 < 0 < a + bza/e 


where Za/2 is the 100(1 — a/2)% quantile of N(0, 1). 

For the case with crabmeat, X = 120, nı = 200, X2 = 4, ng = 35. The 
99% HPD credible interval turns out to be (0.337, 0.635). For the case with 
no crabmeat, Xi = 22, nı = 46, Xə = 0, no = 23 and the 99% HPD credible 
interval is (0.307, 0.650). In both the cases, the hypothesized value (0) of 8 
falls well outside the credible intervals implying strong evidence against the 
null hypothesis of no association. 

We can calculate the significance level P by finding the 100(1 — P)% 
credible interval that has the value O of null hypothesis on its boundary. 
More directly this will be the usual P-value corresponding with the observed 
x? with one d.f. We consider only the case with crabmeat. The other case 
can be handled similarly. The logarithm of the ratio of the maximized likeli- 
hoods under Ho and Hy, is obtained as log A = —15.4891. Therefore P-value 
= P(x? > 30.9782) ~ 0 

We now look at the same problem through the Bayes factor (BF). In order 
to compute the BF, we may use the Beta prior as mentioned above. However, 
because there is no consensus prior for this problem, we use the Schwarz BIC 
(Section 4.3.1) to approximate the BF. For the case with crabmeat, the BF 
arising from BIC is given by BF, = 2.8754 x 10-®. This implies that with 
equal prior probabilities for Hp and Hy, the posterior probability of Ho is 
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[1 +1/BFi|—-| = 2.8754 x 1076. This is very small but not as small as the 
P-value. In any case, both the approaches indicate strong evidence in favor of 
potato salad being the cause of food poisoning. 


4.4 Exercises 


qn 


. With Poisson likelihood and Gamma prior for the Poisson parameter 90, 


show that the posterior is consistent at any 8o > 0. 


. Let Xi,..., Xn be i.i.d. observations with a common density f(z|@), 8 € 
O = {6;,62,...,0,}$. Consider a prior (71, 72,...,7%) , with m; > 0 for 
all ¢, ‘>a; = 1. Suppose the distribution corresponding to f(z|@;), i = 
1,...,& are all distinct. Show that the posterior is consistent at each ĝ;. 


(Hint: Express the posterior in terms of 
Z, = (1/n) E? log(f(X310,)/F(X;18:)), r = 1,- --k.) 


. Show that asymptotic posterior normality (as stated in Theorem 4.2) im- 


plies posterior consistency at 6o. 


. Verify Condition (A4) (see Theorem 4.2) for the N(@,1) example. 
. Obtain Laplace approximation to the integrated likelihood from (4.5). 
. Consider N (p, 1) likelihood. Generate data of size 30 from N(0,1). Con- 


sider the following priors for y : (i) N(0,2) (ii) N(1, 2) (iii) U(—3, 3). For 
each of these priors find P(—0.5 < u < 0.5) and P(—0.2 < u < 0.6) using 
(a) exact calculation (b) normal approximation. 

Do the same thing with data generated from N(1, 1). 


. Let X1,...,Xn bei.i.d. N(0,1) and the prior distribution of 6 be N(0,r?). 


Consider the problem of testing Hy : 0 < 0 versus Hı : 0 > 0. 
(a) Show that the Bayes risk r(r) given by (4.11) reduces to 


rai) = 2f &(—/n0) 1 (0)d0 


where 7(.) denotes the N (0,77) density for 6. 
(b) Verify (4.13) in this case. 


. Find numerically the exact posterior density of 6 = pı — p2 in Example 4.6 


with independent uniform priors for py and pg. Compare this with the 
normal approximation to the posterior. 


. Using the idea of Laplace method for approximating integrals, find the 


following approximation for n! (see Example 4.5) 


ni nw n etery On. 


5 


Choice of Priors for Low-dimensional 
Parameters 


Given data, a Bayesian will need a likelihood function p(z|@) and a prior 
p(@). For many standard problems, the likelihood is known either from past 
experience or by convention. To drive the Bayesian engine, one would still 
need an appropriate prior. In this chapter, we consider only low-dimensional 
parameters. Admittedly, low dimension is not easy to define, but we expect 
the dimension d to be much smaller than the sample size n to qualify as low. 
In most of the examples in this chapter, d = 1 or 2 and is rarely bigger than 
5. 

Ideally, one wants to choose a prior or a class of priors reflecting one’s 
prior knowledge and belief about the unknown parameters or about different 
hypotheses. This is a subjective choice. If one has a class of priors, it would 
be necessary to study robustness of various aspects of the resulting Bayesian 
inference. Choice of subjective priors, usually called elicitation of priors, is still 
rather difficult. For some systematic methods of elicitation, see Kadane et al. 
(1980), Garthwaite and Dickey (1988, 1992). A recent review is Garthwaite 
et al. (2005). 

Empirical studies have shown experience and maturity help a person in 
quantifying uncertainty about an event in the form of a probability. However, 
assigning a fully specified probability distribution to an unknown parameter 
is difficult even when the parameter has a physical meaning like length or 
breadth of some article. In such cases, it may be realistic to expect elicita- 
tion of prior mean and variance or some other prior quantities but not a full 
specification of the distribution. Hopefully, the situation will improve with 
practice, but it is hard to believe that a fully specified prior distribution will 
be available in all but very simple situations. 

It is much more common to choose and use what are called objective priors. 
When very little prior information is available, objective priors are also called 
noninformative priors. The older terminology of noninformative priors is no 
longer in favor among objective Bayesians because a complete lack of infor- 
mation is hard to define. However, it is indeed possible to construct objective 
priors with low information in the sense of Bernardo’s information measure or 
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non-Euclidean geometry. These priors are not unique but, as indicated for the 
Bernoulli example (Example 2.2) in Chapter 2, for even a small sample size 
the posteriors arising from them are very close to each other. All these priors 
are constructed through well-defined algorithms. If some prior information is 
available, in some cases one can modify these algorithms. 

The objective priors are typically improper but have proper posteriors. 
They are suitable for estimation problems and also for testing problems where 
both null and alternative hypotheses have the same dimension. The objective 
priors need to be suitably modified for sharp null hypotheses — the subject 
of Chapter 6. 

Most of this chapter (Sections 5.1, 5.2, and 5.5) is about different principles 
and methods of construction of objective priors (Section 5.1) and common 
criticisms and answers (Section 5.2). Subjective priors appear very naturally 
when the decision maker judges his data to be exchangeable. We deal with 
this in Section 5.3. An example of elicitation of a different kind is given in 
Section 5.4. 


5.1 Different Methods of Construction of Objective 
Priors 


Because this section is rather long, we provide an overview here. 
How can we construct objective priors under general regularity conditions? 
We may do one of the following things. 


1. Define a uniform distribution that takes into account the geometry of the 
parameter space. 

2. Minimize a suitable measure of information in the prior. 

3. Choose a prior with some form of frequentist ideas because a prior with 
little information should lead to inference that is similar to frequentist 
inference. 


To fully define these methods, we have to specify the geometry in (1), the 
measure of information in (2) and the frequentist ideas that are to be used 
in (3). This will be done in Subsections 5.1.2, 5.1.3, and 5.1.4. In Subsection 
5.1.1, we discuss why the usual uniform prior 7(@) = c has come in for a lot of 
criticism. Indeed, these criticisms help one understand the motivation behind 
(1) and (2). It is a striking fact that both (1) and (2) lead to the Jeffreys 
prior, namely, 


m (8) = [det (i;(9))]*/? 


where (I;;(0)) is the Fisher information matrix. In the one-dimensional case, 
(3) also leads to the Jeffreys prior. 

We have noted in Chapter 1 that many common statistical models possess 
additional structure. Some are exponential families of distributions, some are 
location-scale families, or more generally families invariant under a group of 
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transformations. Normals belong to both classes. For each of these special 
classes, there is a different choice of objective priors discussed in Subsections 
5.1.5 and 5.1.6. The objective priors for exponential families come from the 
class of conjugate priors. In the case of location-scale families with scale pa- 
rameter o, the common objective prior is the so-called right invariant Haar 


measure 
1 
Ti (u, a) z o 


and the Jeffreys prior turns out to be the left invariant Haar measure 


1 


T2 (u, o) — 3 


o 
(see Subsection 5.1.7 for definitions). Jeffreys had noted this and expressed his 
preference for the former. As we discuss later, there are several strong reasons 
for preferring 71 to 7. 

To avoid some of the problems with the Jeffreys prior, Bernardo (1979) 
and Berger and Bernardo (1989) had suggested an important modification of 
the Jeffreys prior that we take up in Subsection 5.1.10. These priors are called 
reference priors. In the location-scale case, the reference prior is the right 
invariant Haar measure. They are considerably more difficult to find than the 
Jeffreys prior but explicit formulas are now available for many examples, vide 
Berger et al. (2006). A comprehensive overview and catalogue of objective 
priors, up to date as of 1995, is available in Kass and Wasserman (1996). A 
brief introduction is Ghosh and Mukerjee (1992). 


5.1.1 Uniform Distribution and Its Criticisms 


The first objective prior ever to be used is the uniform distribution over a 
bounded interval. A common argument, based on “ignorance”, seems to have 
been that if we know nothing about 8, why should we attach more density to 
one point than another? The argument given by Bayes, who was the first to 
use the uniform as an objective prior, is a variation on this. It is indicated in 
Problem 1. A second argument is that the uniform maximizes the Shannon 
entropy. The uniform was also used a lot by Laplace who seems to have arrived 
at a Bayesian point of view, independently of Bayes, but his argument seems 
to have been based on subjective argument that in his problems the uniform 
was appropriate. 

The principle of ignorance has been criticized by Keynes, Fisher, and many 
others. Essentially, the criticism is based on an invariance argument. Let 7 = 
(0) be a one-to-one function of 8. If we know nothing about 0, then we know 
nothing about 7 also. So the principle of ignorance applied to 7 will imply our 
prior for 7 is uniform (on 7%(@)) just as it had led to a uniform prior for 8. 
But this leads to a contradiction. To see this suppose ~w is differentiable and 
p(n) = c on w(O). Then the prior p*(@) for @ is 
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p* (8) = p(n) |v ()| = elv’(8)| 


which is not a constant in general. 

This argument also leads to an invariance principle. Suppose we have an 
algorithm that produces noninformative priors for both ĝ and n, then these 
priors p*(@) and p(ņ) should be connected by the equation 


p* (8) = p(n)iv"(9)| (5.1) 


i.e., a noninformative prior should be invariant under one-to-one differentiable 
transformations. 

The second argument in favor of the uniform, based on Shannon entropy, is 
also flawed. Shannon (1948) derives a measure of entropy in the finite discrete 
case from certain natural axioms. His entropy is 


H(p) = —S_ pilogp, 
t=1 


which is maximized by the discrete uniform, i.e., at p = (4, ee, E). En- 
tropy is a measure of the amount of uncertainty about the outcome of the 
experiment. A prior that maximizes this will maximize uncertainty, so it is a 
noninformative prior. Because such a prior should minimize information, we 
take negative of entropy as information. This usage differs from Shannon’s 
identification of information and entropy. 

Shannon’s entropy is a natural measure in the discrete case and the discrete 
uniform appears to be the right noninformative prior. The continuous case is 
an entirely different matter. Shannon himself pointed out that for a density p 


H(p) = — I (log p(x))p(a) dz 


is unsatisfactory, clearly it is not derived from axioms, it is not invariant under 
one-one transformations, and, as pointed out by Bernardo, it depends on the 
measure u(x)dx with respect to which the density p(x) is taken. Note also that 
the measure is not non-negative. Just take u(x) = 1 and take p(x) = uniform 
on [0, c]. Then H(p) > 0 if and only if c > 1, which seems quite arbitrary. 

Finally, if the density is taken with respect to u(x) dx, then it is easy to 
verify that the density is p/p and 


H(p) = -j (108 A PIZ ufa) dx 


is maximized at p = u, i.e., the entropy is maximum at the arbitrary p. 
For all these reasons, we do not think H(p) is the right entropy to maxi- 
mize. A different entropy, also due to Shannon, is explored in Subsection 5.1.3. 
However, H(p) serves a useful purpose when we have partial information. 
For details, see Subsection 5.1.12. 
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5.1.2 Jeffreys Prior as a Uniform Distribution 


This section is based on Section 8.2.1 of Ghosh and Ramamoorthi (2003). We 
show in this section that if we construct a uniform distribution taking into ac- 
count the topology, it automatically satisfies the invariance requirement (5.1). 
Moreover, this uniform distribution is the Jeffreys prior. Problem 2 shows one 
can construct many other priors that satisfy the invariance requirement. Of 
course, they are not the uniform distribution in the sense of this section. Being 
an invariant uniform distribution is more important than just being invariant. 
Suppose O = R? and I(@) = (I;;(0)) is the d x d Fisher information matrix. 
We assume I(@) is positive definite for all 8. Rao (1987) had proposed the 
Riemannian metric p related to I(@) by 


p(0,0 + d0) = X` T; ;(0) d6; d0;(1 + o(1)). 


It is known, vide Cencov (1982), that this is the unique Riemannian metric 
that transforms suitably under one-one differentiable transformations on O. 
Notice that in general O does not inherit the usual Euclidean metric that goes 
with the (improper) uniform distribution over R. 

Fix a 0o and let a(@) be a smooth one-to-one transformation such that 
the information matrix 


Ologp Olog p 
p — 
x ( OW, Ov; ) 


is the identity matrix I at wo = (89). This implies the local geometry 
in the w-space around wo is Euclidean and hence dy is a suitable uniform 
distribution there. If we now lift this back to the 0-space by using the Jacobian 
of transformation and the simple fact 


(lasl) 40 (Ta 


we get the Jeffreys prior in the @-space, 











/ 
) Iu 








1 


OW; 





y dð = {det|I; ;(0)]}? d8. 





dy) = {det 





A similar method is given in Hartigan (1983, pp. 48, 49). Ghosal et al. (1997) 
present an alternative construction where one takes a compact subset of the 
parameter space and approximates this by a finite set of points in the so-called 
Hellinger metric 


1 


(Pa, Po) = | f (Pa - Vm) ae] 
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where pg and pg are the densities of Pg and Pg. One then puts a discrete 
uniform distribution on the approximating finite set of points and lets the de- 
gree of approximation tend to zero. Then the corresponding discrete uniforms 
converge weakly to the Jeffreys distribution. The Jeffreys prior was introduced 
in Jeffreys (1946). 


5.1.3 Jeffreys Prior as a Minimizer of Information 


As in Subsection 5.1.1, let the Shannon entropy associated with a random 
variable or vector Z be denoted by 


H(Z) = H(p) = —Ep(log p(Z)) 


where p is the density (probability function) of Z. Let X = (X1, X2,...,Xn) 
have density or probability function p(ax|@) where @ has prior density p(@). 
We assume X 1, X2,..., Xn are ii.d. and conditions for asymptotic normal- 
ity of posterior p(@|a) hold. We have argued earlier that H(p) is not a good 
measure of entropy and —H(p) not a good measure of information if p is a 
density. Using an idea of Lindley (1956) in the context of design of experi- 
ments, Bernardo (1979) suggested that a Kullback-Leibler divergence between 
prior and posterior, namely, 





e [ { [ | [ og | a p(6"e)de p(a|6) de} p(0)dð (5.2) 


is a better measure of entropy and —J a better measure of information in the 
prior. To get a feeling for this, notice that if the prior is nearly degenerate, at 
say some ĝo, so will be the posterior. This would imply J is nearly zero. On the 
other hand, if p(@) is rather diffuse, p(@|x) will differ a lot from p(@), at least 
for moderate or large n, because p(@|a) would be quite peaked. In fact, p(@|a) 
would be approximately normal with mean Ô and variance of the order O(=). 
The substantial difference between prior and posterior would be reflected by 
a large value of J. To sum up J is small when p is nearly degenerate and large 
when p is diffuse, i.e., J captures how diffuse is the prior. It therefore makes 
sense to maximize J with respect to the prior. 

Bernardo suggested one should not work with the sample size n of the given 
data and maximize the J for this n. For one thing, this would be technically 
forbidding in most cases and, more importantly, the functional J is expected 
to be a nice function of the prior only asymptotically. We show below how 
asymptotic maximization is to be done. Berger et al. (1989) have justified 
to some extent the need to maximize asymptotically. They show that if one 
maximizes for fixed n, maximization may lead to a discrete prior with finitely 
many jumps — a far cry from a diffuse prior. We also note in passing that the 
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measure J is a functional depending on the prior but in the given context of 
a particular experiment with i.i.d. observations having density p(z|@). This is 
a major difference from the Shannon entropy and suggests information in a 
prior is a relative concept, relative to a particular experiment. 

We now return to the question of asymptotic maximization. Fix an in- 
creasing sequence of compact d-dimensional rectangles K; whose union is R®. 
For a fixed K;, we consider only priors p; supported on K;, and let n — oo. 
We assume posterior normality holds in the Kullback-Leibler sense, i.e., 


P(AX)\ _ im oo POX | = 
Jim E (108 van) = dim I. Eo fı g POE | (0) d@=0 (5.3) 


where is the approximating d-dimensional normal distribution N (8, 171 (8)/n). 
For sufficient conditions see Clarke and Barron (1990) and Ibragimov and 
Has’minskii (1981). 

In view of (5.3), it is enough to consider 


iin) = f LS Lf rod Ge A piola) io! p(æl0) dr} (8) do. 


Using appropriate results on normal approximation to posterior distribution, 
it can be shown that 


fio) =f tf Lf o ERD \ a'la) io” pla|d) dæ } p(B) d8 + op(1 


= e log(27) — + Z log n} + J log(det 1(8))2p;(8) d0 
K; 








£ | (log p;(0))p:(8) d8 + op(1). (5.4) 


Here we have used the well-known fact about the exponent of a multivariate 
normal that 


1 A 7 R 
f -30 -DOn O — DAO e) a0" = -5 
Hence by (5.3) and (5.4), we may write 


(det(7(0)))? 


J(p;) = (5 log(27) — : = T logn} -+ f. log f eet so dð + op(1). 


Thus apart from a constant and a negligible op(1) term, J is the functional 


J ogles(det(1(8)))"/?/p:()]p:(8) d8 — log c; 


z 


where c; is a normalizing constant such that c;[det(7(0))]!/2 is a probability 
density on K;. The functional is maximized by taking 
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vagy _ J ci[det(1(@))]*/? on Ky; 
p:(0) = l 0 elsewhere. (5.5) 


Thus for every K;, the (normalized) Jeffreys prior p;(@) maximizes Bernardo’s 
entropy. In a very weak sense, the p;(@) of (5.5) converge to p(@) = [det(I(0))]1/2, 
namely, for any two fixed Borel sets Bı and By contained in K;, for some to, 


ia te, ee dé _ Ja, p(@) de 
i-t00 Te, p;(0 “de p(0) dd 


The convergence based on #4 is very weak. Berger and Bernardo (1992, 
Equation 2.2.5) suggest convergence based on a metric that compares the 
posterior of the proper priors over compact sets B; and the limiting improper 
prior (whether Jeffreys or reference or other). Examples show lack of conver- 
gence in this sense may lead to severe inadmissibility and other problems with 
inference based on the limiting improper prior. However, checking this kind of 
convergence is technically difficult in general and not attempted in this book. 

We end this section with a discussion of the measure of entropy or infor- 
mation. In the literature, it is often associated with Shannon’s missing infor- 
mation. Shannon (1948) introduced this measure in the context of a noisy 
channel. Any channel has a source that produces (say, per second) messages 
X with p.m.f. px(zx) and entropy 


(5.6) 


H(X) = = Lexa ) log px (2). 
A channel will have an output Y (per second) with entropy 


H(Y) = —X_ py (y) log py (y). 


y 


If the channel is noiseless, then H(Y) = H(X). 
If the channel is noisy, Y given X is still random. Let p(x, y) denote their 
joint p.m.f. The joint entropy is 


H(X,Y) =- X p(z,y) log p(x, y). 
ZY 


Following Shannon, let p(y) = P{Y = y|X = x} and consider the conditional 
entropy of Y given X namely, 


Hx(Y) = — 5 p(z, y) log pe(y). 
L,Y 
Clearly, H(X,Y) = H(X)+Hx(Y) and similarly H(X,Y) = H(Y)+Hy(X). 
Hy (X) is called the equivocation or average ambiguity about input X given 
only output Y. It is the information about input X that is received given the 
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output Y. By Theorem 10 of Shannon (1948), it is the amount of additional 
information that must be supplied per unit time at the receiving end to correct 
the received message. 

Thus Hy(X) is the missing information. So amount of information pro- 
duced in the channel (per unit time) is 


H(X) — Hy(X) 
which may be shown to be non-negative by Shannon’s basic results 
A(X)+ A(Y) > A(X,Y) = A(Y) + Ay(X). 


In statistical problems, we take X to be 6 and Y to be the observation vector 
X. Then H(#) — H y (6) is the same measure as before, namely 


B (1g 28). 


The maximum of 
H(X) — Hy(X) 


with respect to the source, i.e., with respect to p(x) is what Shannon calls the 
capacity of the channel. Over compact rectangles, the Jeffreys prior is this 
maximizing distribution for the statistical channel. 

It is worth pointing out that the Jeffreys prior is a special case of the 
reference priors of Bernardo (1979). 

Another point of interest is that as n — oo, most of Bernardo’s information 
is contained in the constant term of the asymptotic decomposition. This would 
suggest that for moderately large n, choice of prior is not important. 

The measure of information used by Bernardo was introduced earlier in 
Bayesian design of experiments by Lindley (1956). There p(@) is fixed but the 
observable X is not fixed, and the object is to choose a design, i.e., X, to 
minimize the information. Minimization is for the given sample size n, not 
asymptotic as in Bernardo (1979). 


5.1.4 Jeffreys Prior as a Probability Matching Prior 


One would expect an objective prior with low information to provide inference 
similar to that based on the uniform prior for 6 in N(6,1). 

In the case of N(@,1) with a uniform prior for 8, the posterior distribu- 
tion of the pivotal quantity 0 — X, given X, is identical with the frequentist 
distribution of 0 — X, given 0. In the general case we will not get exactly the 
same distribution but only up to O,(n~*). A precise definition of a probability 
matching prior for a single parameter is given below. 

Let X1,Xo,...,Xn be iid. p(x|0), 0 € O C R. Assume regularity condi- 
tions needed for expansion of the posterior with the normal N(6,(nI(@))~!) 
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as the leading term. For 0 < & < 1, choose 6,(X) depending on the prior 
p(@) such that 
P{0 < 0a(X)|X} = 1 — a + Op(n™}). (5:7) 


It can be verified that 0,,(X) = ô+ Op(1/./n). We say p(@) is probability 
matching (to first order), if 


Po{O < 0,.(X)} =1-—a+O(n"?) (5.8) 
(uniformly on compact sets of 0). In the normal case with p(@) = constant, 
bal X) = X + za/ vn 


where P{Z > za} =a, Z ~ N(0,1). 

We have matched posterior probability and frequentist probability up to 
O,(n—*). Why one chooses this particular order may be explained as follows. 
For any prior p(@) satisfying some regularity conditions, the two probabilities 
mentioned above agree up to O(n~2), so O(n~2) is too weak. On the other 
hand, if we strengthen O(n~') to, say, O(n-2), in general no prior would be 
probability matching. So O(n~") is just right. 

It is instructive to view probability matching in a slightly different but 
equivalent way. Instead of working with 0a( X) one may choose to work with 
the approximate quantile Ô + za //n and require 


P{0 < 04 2q/Vn|X} = Po{O < Ê+ za/ vn} + O,(n7) (5.9) 


under @ (uniformly on compact sets of 0). 

Each of these two probabilities has an expansion starting with (1— a) and 
having terms decreasing in powers of n—-2. So for probability matching, we 
must have the same next term in the expansion. 

In principle one would have to expand the probabilities and set the two 


second terms equal, leading to 


—1/2 
a = TO 12 r. (5.10) 


The left-hand side comes from the frequentist probability, the right-hand side 
from the posterior probability (taking into account the limits of random quan- 
tities under 0). There are many common terms in both probabilities that can- 
cel and hence do not need to be calculated. A convenient way of deriving 
this is through what is called a Bayesian route to frequentist calculations or 
a shrinkage argument. For details see Ghosh (1994, Chapter 9) or Ghosh and 
Mukerjee (1992) , or Datta and Mukerjee (2004). 

If one tried to match probabilities upto O(n~?/*), one would have to match 
the next terms in the expansion also. This would lead to two differential 
equations in the prior and in general they will not have a common solution. 

Clearly, the unique solution to (5.10) is the Jeffreys prior 
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p(@) x V1(8). 


Equation (5.10) may not hold if Ê has a discrete lattice distribution. Suppose 
X has a discrete distribution. Then the case where 6 has a lattice distribu- 
tion causes the biggest problem in carrying through the previous theory. But 
the Jeffreys prior may be approximately probability matching in some sense, 
Ghosh (1994), Rousseau (2000), Brown et al. (2001, 2002). 

If d > 1, in general there is no multivariate probability matching prior 
(even for the continuous case), vide Ghosh and Mukerjee (1993), Datta (1996). 
It is proved in Datta (1996) that the Jeffreys prior continues to play an im- 
portant role. 

We consider the special case d = 2 by way of illustration. For more details, 
see Datta and Mukerjee (2004). 

Let 6 = (81,02) and suppose we want to match posterior probability of 0; 
and a corresponding frequentist probability through the following equations. 


P{0; < 61,.(X)|X} =1-a+4+0,(n“°), (5.11) 


P{0; < 91.0(X)|61, 02} = 1 — a + 0O(n™}). (5.12) 


Here 6; ,(X) is the (approximate) 100(1—a)- quantile of 01. If 82 is orthogonal 
to fı in the sense that the off-diagonal element J,2(@) of the information 
matrix is zero, then the probability matching prior is 


0) = v I (9) b (02) 


where 7(@2) is an arbitrary function of 69. 

For a general multiparameter model with a one-dimensional parameter 
of interest 6; and nuisance parameters 02,...,8g, the probability matching 
equation is given by 


2.6 
Z zg PO (TI u =f} (5.13) 


where I~'(@) = (I). This is obtained by equating the coefficient of n71? 
in the expansion of the left-hand side of (5.12) to zero; details are given, for 
example, in Datta and Mukerjee (2004). 


Example 5.1. Consider the location-scale model 


palmo) = Ès (22E), 


-00 < u < 00, o > 0, where f(-) is a probability density. Let 0; = u and 
62 = øg, i.e., u is the parameter of interest. It is easy to verify that II! œ o? 
for j = 1,2 and hence in view of (5.13) the prior 
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Í 
P(E, o) X = 


is probability matching. Similarly, one can also verify that the same prior is 
probability matching when o is the parameter of interest. 


Example 5.2. We now consider a bivariate normal model with means 11, H2, 
variances o7,0%, and correlation coefficient p, all the parameters being un- 
known. Suppose the parameter of interest is the regression coefficient po2/01. 
We reparameterize as 


6; = po2/01, O2 Gal =p"), 03 Po. 04 = Mi, Os = u2 
which is an orthogonal parameterization in the sense that 1,;(@) = 0 for 


2<j<5. Then I" (0) = 0 for 2 < j < 5, I! (0) = 7 (0) = 62/63, and the 
probability matching equation (5.13) reduces to 


ð z 
ag, PO) =0, 
ke 
2 1/2 
59; {P(9)(82/8a)"/?} = 0 
Hence the probability matching prior is given by 
p(0) = (62,...,45)(03/02)!/? 


where w(@2,...,65) is an arbitrary smooth function of (62,..., 65). 
One can also verify that a prior of the form 


p* (41, H2, 01,02, p) = {oja5(1 = PYF 


with reference to the original parameterization is probability matching if and 
only if t = fs + 1 (vide Datta and Mukerjee, 2004, pp. 28, 29). 


5.1.5 Conjugate Priors and Mixtures 
Let Xi,..., Xn be i.i.d. with a one-parameter exponential density 
p(x|0) = exp{A(0) + y(x) + h(x)}. 
We recall from Chapter 1 that T = $7 y(X;:) is a minimal sufficient statistic 
and Eg(y(X1)) = —A’(@). The likelihood is 
n 
exp{nA(0)+ 0T} exp{* h(X;)}. 
1 


To construct a conjugate prior, i.e., a prior leading to posteriors of the same 
form, start with a so-called noninformative, possibly improper density (8). 
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We will choose u to be the uniform or the Jeffreys prior density. Then define 
a prior density 
p(0) = ce A0)¥994(4) (5.14) 


where c is the normalizing constant, 


c= | eA) +95 119) d0 
O 


—1 


if the integral is finite, and arbitrary if p(@) is an improper prior. The constants 
m and s are hyperparameters of the prior. They have to be chosen so that the 
posterior is proper, i.e., 


o < o0. 


In this case, the posterior is 
p(O|a) = cle’ MAOD+ASTT) 149), (5.15) 


i.e., the posterior is of the same form as the prior. Only the hyperparameters 
are different. 

In other words, the family of priors p(@) (vide (5.14)) is closed with respect 
to the formation of posterior. The form of the posterior allows us to interpret 
the hyperparameters in the prior. Assume initially the prior was u. Take m 
to be a positive integer and think of a hypothetical sample of size m, with 
hypothetical data x},...,a/,, such that 577" Y(z;) = s. The prior is the same as 
a posterior with u as prior and s as hypothetical data based on a sample of size 
m. This suggests m is a precision parameter. We expect that larger the m, the 
stronger is our faith in such quantities as the prior mean. The hyperparameter 
s/m has a simple interpretation as a prior guess about Fg(y) = —A’(@), which 
is usually an important parametric function. 

To prove the statement about s/m, we need to assume j4(@) = constant, 
i.e., u is the uniform distribution. We also assume all the integrals appearing 
below are finite. 

Let O = (a,b), where a may be —oo, b may be oo. Integrating by parts 


b 
BEKO J (—A'(6) eA +95 qo 





A(O b 
Z fe (2) gra" +e f emA(9) 08 19 
m s Mm Ja 
ar (5.16) 
= 
if e'™4A@)+0s — 0 at 6 = a,b, which is often true if O is the natural parameter 


space. Diaconis and Ylvisaker (1979) have shown that (5.16) characterizes the 
prior. 
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A similar calculation with the posterior, vide Problem 3, shows the poste- 
rior mean of — A’ (9) is 


m 8 n 


E(—A'(0)|X) = (—4'(6)). (5.17) 








mit+nm mtn 


i.e., the posterior mean is a weighted mean of the prior guess s/m and the MLE 
—A'(6) = T/n and the weights are proportional to the precision parameter m 
and the sample size n. 

If u is the Jeffreys distribution, the right-hand side of (5.16), i.e., s/m may 


be interpreted as 


E (-A'(6)//-A"®) /E (1/V=4"@) . 


i.e., s/m is a ratio of two prior guesses — a less compelling interpretation 
than for u = uniform. 

Somewhat trivially u itself, whether uniform or Jeffreys, is a conjugate 
prior corresponding to m = 0,s = 0. Also, in special cases like the binomial 
and normal, the Jeffreys prior is a conjugate prior with = uniform. We 
do not know of any general relation connecting the Jeffreys prior and the 
conjugate priors with = uniform. 

Conjugate priors, specially with = uniform, were very popular because 
the posterior is easy to calculate, the hyperparameters are easy to interpret 
and hence elicit and the Bayes estimate for Eg(w(X )) has a nice interpretation. 
All these facts generalize to the case of multiparameter exponential family of 
distribution 


d 
p(x]0) = exp{A(@) + X ` bip: (æ) + h(a}. 


The conjugate prior now takes the form 


d 
p(0) = exp{mA(0) + X` 6;8;} (8) 


where u = uniform or Jeffreys, m is the precision parameter and s;/m may be 
interpreted as the prior guess for Fg(w;(X)) if u = uniform. Once again the 
hyperparameters are easy to elicit. Also the Bayes estimate for Eg(y;(X)) is 
a weighted mean of the prior guess and the MLE. 

It has been known for some time that all these alternative properties can 
also be a problem. First of all, note that having a single precision parameter 
even for the multiparameter case limits the flexibility of conjugate priors; 
one cannot represent complex prior belief. The representation of the Bayes 
estimate as a weighted mean can be an embarrassment if there is serious 
conflict between prior guess and MLE. For example, what should one do if 
the prior guess is 10 and MLE is 100 or vice versa? In such cases, one should 
usually give greater weight to data unless the prior belief is based on reliable 
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expert opinion, in which case greater weight would be given to prior guess. 
In any case, a simple weighted mean seems ridiculous. A related fact is that 
a conjugate prior usually has a sharp tail, whereas prior knowledge about the 
tail of a prior is rarely strong. 

A cure for these problems is to take a mixture of conjugate priors by 
putting a prior on the hyperparameters. The class of mixtures is quite rich 
and given any prior, one can in principle construct a mixture of conjugate 
priors that approximates it. A general result of this sort is proved in Dalal 
and Hall (1980). A simple heuristic argument is given below. 

Given any prior one can approximate it by a discrete probability dis- 
tribution (pi,...,pe) over a finite set of points, say (71,---,7~) where 
n; = Eo,(¥i(X),---,Wa(X)). This may be considered as a mixture over 
k degenerate distributions of which the jth puts all the probability on nj. 
By choosing m sufficiently small and taking the prior guess equal to 7, one 
can approximate the k degenerate distributions by k conjugate priors. Finally, 
mix them by assigning weight p; to the jth conjugate prior. 

Of course the simplest applications would be to multimodal priors. The 
posterior for a mixture of conjugate priors can often be calculated numerically 
by MCMC (Markov chain Monte Carlo) method. See Chapter 7 for examples. 

As an example of a mixture we consider the Cauchy prior used in Jeffreys 
test for normal mean pu with unknown variance g?, described in Section 2.7.2. 
The conjugate prior for u given g? is normal and the Cauchy prior used here 
is a scale mixture of normals N(0,7~') where 7 is a mixing Gamma variable. 
This mixture has heavier tail than the normal and use of such prior means the 
inference is influenced more by the data than the prior. It is expected that, 
in general, mixtures of conjugate priors will have this property, but we have 
not seen any investigation in the literature. 


5.1.6 Invariant Objective Priors for Location-Scale Families 


Let 6 = (u,a), —œ < u < œ,0 > 0 and 





p(x\0) = -f (2 — £) (5.18) 


O 


where f(z) is a probability density on R. Let I o be the 2 x 2 Fisher infor- 
mation matrix. Then easy calculation shows 


Í 
lig = zz 0,1 


which implies the Jeffreys prior is proportional to 1/a*. We show in Sec- 
tion 5.1.7 that this prior corresponds with the left invariant Haar measure 
and 


1 
p2(H, a) — E 


136 5 Choice of Priors for Low-dimensional Parameters 


corresponds to the right invariant Haar measure. See Dawid (Encyclopedia of 
Statistics) for other relevant definitions of invariance and their implications. 
We discuss some desirable properties of p> in Subsection 5.1.8. 


5.1.7 Left and Right Invariant Priors 


We now derive objective priors for location-scale families making use of in- 
variance. Consider the linear transformations 


Ja b£ =a +br, —œ < a < œ,b >Q. 


Then 
Ge,d-Ja,b-& = C + d(a + bx) = c + ad + dbz. 


We may express this symbolically as ge,d-9a,b = Ge,f where e = c + ad, f = db 
specify the multiplication rule. Let G = {9a b; -co < a < œ,b > 0}. Then G 
is a group. 

It is convenient to represent ga by the vector (a,b) and rewrite the mul- 
tiplication rule as 


(c, d).(a, b) = (e, f). (5.19) 


Then we may identify R x Rt with G and use both notations freely. We give 
R x Rt its usual topological or geometric structure. The general theory of 
(locally compact) groups (see, e.g., Halmos (1950, 1974) or Nachbin (1965)) 
shows there are two measures pz; and ug on G such that p is left invariant, 
i.e., for all g € Gand ACG, 


pı(gA) = m (A) 


and u2 is right invariant, i.e., for all g and A, 


p2( Ag) = u2(A) 


where gA = {gg';g' € A}, Ag = {9'9; 9" € A}. 

The measures jt; and pg are said to be the left invariant and right invari- 
ant Haar measures. They are unique up to multiplicative constants. We now 
proceed to determine them explicitly. 

Suppose we assume u> has a density fo, i.e., denoting points in R x Rt 


by (a1, a2) 


pa(A) = | fo(a1, a2) dar daz (5.20) 
A 
and assume fz is a continuous function. With g = (bı, b2) and (c1, c2) = 
(a1, a2). (b1, b2) with 
Cy = a, + a2b1, co = a2b9, (5.21) 


one may evaluate u2( Ag) in several ways, e.g., 
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H2 (Ag) = : h(c, C2) dcı deg (5.22) 
Ag 
where 
h(c1, €2) = fo(a1,a2)(J)~* (5.23) 
with a ) i 
— C1, C2 _ l a U 
ni O(a,,a2) o bo| b2. 
Also, by definition of fə 
H2 (Ag) = J fo (c1, C2) dc dc. (5.24) 
Ag 


Because (5.22) and (5.24) hold for all A and f2 is continuous, we must have 


fo (c1, C2) = her, C2), (5.25) 


1 
i.e., fo(c1,c2) = fala aa) > 


for all (a1, @2), (b1, b2) < R x RF. Set ay = 0, ag = L. Then f2(b1, ba) = 
f2(0, 1D reer Le., 
1 
fo(b1, 62) = constant o (5.26) 
2 


It is easy to verify that u2 defined by (5.20) is right invariant if fə is as in 
(5.26). One has merely to verify (5.25) and then (5.22). 

Proceeding in the same way, one can show that the left invariant Haar 
measure has density 


1 
fi(bi, b2) = e 
2 


We have now to lift these measures to the (4,0)-space. To do this, we first 
define an isomorphic group of transformations on the parameter space. Each 
transformation ga p£ = a+ bx on the sample space induces a transformation 
Ga,» defined by 


Gaps o) = (a + bp, bo), 
i.e., the right-hand side gives the location and scale of the distribution of gab X 


where X has density (5.18). The transformation ga, b — Ja,» is an isomorphism, 
1.€., 


Taos (gz) 
and if 

Ja,b-Gce,d = Ge,f 
then 

Ga,b-Je,d = Ge,f- 
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In view of this, we may write Ja, a, also as (a1, a2) and define the group mul- 
tiplication by (5.19) or (5.21). Consequently, left and right invariant measures 
for g are the same as before and 


1 
duı(bı, b2) = constant zz doi dbz, 
2 


1 
duz2(b1,b2)} = constant z dhidba. 
1 


We now lift these measures on to the (,0)-space by setting up a canonical 
transformation 


(4,0) = Gag 0; 1) 


that converts a single fixed point in the parameter space into an arbitrary 
point (u,0). Because (0,1) is fixed, we can think of the above relation as 
setting up a one-to-one transformation between (4,0) and J, o = (4,0). Be- 
cause this is essentially an identity transformation from (4,0) into a group of 
transformations, given any u* on the space of g’s we define v on O = R xX Rt 
as 


v(A) = w*{(U,0)3 Guo E AS} = p (A). 
Thus i 
dvı(u, o) = din (u, 0) = -3 dp do 
and i 
dvz(u, o) = dp2(p, 0) = ~ dy do 


are the left and right invariant priors for (4,0). 


5.1.8 Properties of the Right Invariant Prior for Location-Scale 
Families 


The right invariant prior density 


1 
Dr (HL, a) cits 


5 
has many attractive properties. We list some of them below. These properties 
do not hold in general for the left invariant prior 


1 
Pl (u, o) = Be : 
Heath and Sudderth (1978, 1989) show that inference based on the posterior 
corresponding with p, is coherent in a sense defined by them. Similar proper- 
ties have been shown by Eaton and Sudderth (1998, 2004). Dawid et al. (1973) 
show that the posterior corresponding to p, is free from the marginalization 
paradox. It is free from the marginalization paradox if the group is amenable. 
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Dawid (Encyclopedia of Statistics) also provides counter-examples in case the 
group is not amenable. Amenability is a technical condition that is also called 
the Hunt-Stein condition that is needed to prove theorems relating invariance 
to minimaxity in classical statistics, vide Bondar and Milnes (1981). Datta 
and Ghosh (1996) show p, is probability matching in a certain strong sense. 

A famous classical theorem due to Hunt and Stein, vide Lehmann (1986), 
or Kiefer (1957), implies that under certain invariance assumptions (that in- 
clude amenability of the underlying group), the Bayes solution is minimax 
as well as best among equivariant rules, see also Berger (1985a). We con- 
sider a couple of applications. Suppose we have two location-scale families 
ao fi((2 — w)/o), i = 0,1. For example, fọ may be standard normal i.e., 
N(0,1) and fı may be standard Cauchy, i.e., 


Fa 1 1 


m+ a2 





The observations X1, X2,..., Xn are i.i.d. with density belonging to one of 
these two families. One has to decide which is true. 
Consider the Bayes rule which accepts fı if 


= xa 

J JIL- jo fp (5) idu do 
ee a N 
ff Than [om fo(2=4)| 2 dudo 
If c is chosen such that the Type 1 and Type 2 error probabilities are the same, 
then this is a minimax test, i.e., it minimizes the maximum error probability 

among all tests, where the maximum is over i = 0,1 and (o) ER x RF. 
Suppose we consider the estimation problem of a location parameter with 


a squared error loss. Let X1, X2,...,Xn be iid. ~ fo(x — 0). Here pr = pj = 
constant. The corresponding Bayes estimate is 


S- OT F(X; — 9) d0 
JZ Tt F(X; — 0) d0 


which is both minimax and best among equivariant estimators, i.e., it mini- 
mizes R(0,T(X)) = Eo(T(X) — 0Y among all T satisfying 


BF 


T(z1ı +4,...,%n +a) =T(£1,..., Zn) +a, a ER. 


A similar result for scale families is explored in Problem 4. 


5.1.9 General Group Families 


There are interesting statistical problems that are left invariant by groups 
of transformations other than the location-scale transformations discussed in 
the preceding subsections. It is of interest then to consider invariant Haar 
measures for such general groups also. An example follows. 
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Example 5.3. Suppose X ~ N,(0, I). It is desired to test 
Ho: 0 = 0 versus Hı : 640. 


This testing problem is invariant under the group Go of all orthogonal trans- 
formations; i.e., if H is an orthogonal matrix of order p, then gy X = HX ~ 
N,(H6,1I), so that gy@ = H@. Also, gy0 = O. Further discussion of this 
example as well as treatment of invariant tests is taken up in Chapter 6. Dis- 
cussion on statistical applications involving general groups can be found in 
sources such as Eaton (1983, 1989), Farrell (1985), and Muirhead (1982). 


5.1.10 Reference Priors 


In statistical problems that are left invariant by a group of nice transforma- 
tions, the Jeffreys prior turns out to be the left invariant prior, vide Datta and 
Ghosh (1996). But for reasons outlined in Subsection 5.1.8, one would prefer 
the right invariant prior. In all the examples that we have seen, an interesting 
modification of the Jeffreys prior, introduced in Bernardo (1979) and further 
refined in Berger and Bernardo (1989, 1992a and 1992b), leads to the right 
invariant prior. These priors are called reference priors after Bernardo (1979). 
A reference prior is simply an objective prior constructed in a particular way, 
but the term reference prior could, in principle, be applied to any objective 
prior because any objective prior is taken as some sort of objective or conven- 
tional standard, i.e., a reference point with which one may compare subjective 
priors to calibrate them. 

As pointed out by Bernardo (1979), suitably chosen reference priors can 
be appropriate in high-dimensional problems also. We discuss this in a later 
chapter. 

Our presentation in this section is based on Ghosh (1994, Chapter 9) and 
Ghosh and Ramamoorthi (2003, Chapter 1). 

If we consider all the parameters of equal importance, we maximize the 
entropy of Subsection 5.1.3. This leads to the Jeffreys prior. To avoid this, 
one assumes parameters as having different importance. We consider first the 
case of d = 2 parameters, namely, (01,82), where we have an ordering of the 
parameters in order of importance. Thus 6; is supposed to be more important 
than 92. For example, suppose we are considering a random sample from 
N(y,07). If our primary interest is in p, we would take 6; = p, 02 = o°. If 
our primary interest is in 07, then 6; = a7, 62 = u. If our primary interest is 
in u/c, we take 6, = y/o and 62 = p or o or any other function such that 
(81,02) is a one-to-one sufficiently smooth function of (p, o). 

For fixed 0, the conditional density p(@2/@1) is one dimensional. Bernardo 
(1979) recommends setting this equal to the conditional Jeffreys prior 


(91) y I22(0). 


Having fixed this, the marginal p(0,) is chosen by maximizing 
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6,|X1,...,X 
p(91) 


in an asymptotic sense as explained below. 

Fix an increasing sequence of closed and bounded, i.e., compact rectangles 
Kı; x Ko; whose union is O. Let p;(@2|6,) be the conditional Jeffreys prior, 
restricted to Kə; and p;(@,) a prior supported on Kı;. As mentioned before, 


pi(02101) = c;(61) V 22(8) 


where c;(9;) is a normalizing constant such that 


| pi (02|03 ) dO. s 
Koi 


Then p;(0,, 62) = pi(01)p;(@2|01) on Ky; x Ka; and we consider 


J(pi(91),X) =E {log #0 


pi (1) 
_ pi(O|X) | pi(92|01, X) 
= E [og pi() | E fio pi(02|01) | 


= J(p:(01:02),X) = | pi(0)I (p: (0101), Xa 


1z 


where for fixed 81, J(p;(62)6,), X) is the Lindley-Bernardo functional 
ek A 
pi(82|01) 


with p;(@2|0,) being regarded as a conditional prior for 8z for fixed 04. 

Applying the asymptotic normal approximation to the first term on the 
right-hand side as well as the second term on the right-hand side, as in Sub- 
section 5.1.3, 


J (pi (01), X) 
Fe [ P(O) log{det(7(8))}ž d8 — 


J= B flog 


TO p:(0)a0]| 
Kı; X K2; 


-j p(n) | pi(02101) hog VT22(8 — log pi (02191) | d62d0; 
Kı; Koi 


det(I(@)) 


1 

2 

d@5d@ 
Ip2(8) mae 


=K,+ v(a) f pi(02|01) gd 
Koi 


Kui 
= J pi(91) log p:(81)d0 
Kyi 


1 )2 
= Ky +f pi) | ps(62|6) log | a ) dzd; 
iy Kai I (0) 


= / pi(O1) log pi(A1) dA (5.27) 
Kii 





142 5 Choice of Priors for Low-dimensional Parameters 


where K,, is a constant depending on n. Let w;(0,) be the geometric mean of 
(I11(@))—2 with respect to p;(82|01). Then (5.27) can be written as 


w;(41) 
pi(Bi) 





Kn +f p;(9,) log 
which is maximized if 


pi(O1) = ciyi(01) on Ky; 
=0 outside. 


The product : 
pi(O) = c;%;(A1)es(91) [L22()]? 


is the reference prior on Ky; X Ko;. If we can write this as 


p;i(@) = di A(6,, 92) on Ky; x Ko; 


= 0 elsewhere 


then the reference prior on © may be taken as proportional to A(6j, 62). 

Clearly, the reference prior depends on the choice of (01,42) and the com- 
pact sets K,;, Ko;. The normalization on Ky; x Ko; first appeared in Berger 
and Bernardo (1989). If an improper objective prior is used for fixed n, one 
might run into paradoxes of the kind exhibited by Fraser et al. (1985). See in 
this connection Berger and Bernardo (1992a). Recently there has been some 
change in the definition of reference prior, but we understand the final results 
are similar (Berger et al. (2006)). 

The above procedure is based on Ghosh and Mukerjee (1992) and Ghosh 
(1994). Algebraically it is more convenient to work with [J(@)|~+ as in Berger 
and Bernardo (1992b). We illustrate their method for d = 3, the general case 
is handled in the same way. 

First note the following two facts. 

A. Suppose 4),62,...,0; follow multivariate normal with dispersion matrix 
+’. Then the conditional distribution of 6; given 61, @2,...,@;~1 is normal with 
variance equal to the reciprocal of the (j, j)-th element of Xt. 

B. Following the notations of Berger and Bernardo (1992b), let S = [I(@)|~" 
where I(@) is the dx d Fisher information matrix and S; be the j x j principal 
submatrix of S. Let H; =a and h; be the (j, 7)- in element of H;. Then 
by A, the asymptotic variance of 0; given 01, 02,...,0j—1 is (h;)~1/n. To get 
some feeling for hj, note that for arbitrary d, j = d, hi: Sa a for arbitrary 
7 oe ee Ie h=, 

We now provide a new asymptotic formula for Lindley-Bernardo informa- 
tion measure, namely, 


p(O;|A1,-. i) 
E | log —— 
( i p(O;|61,..-, 95-1) 
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= E(log(N(6;(01,.--0;-1), hj *(@)/n))) — E(log pi(0;|61, .--,45—-1)) 
(where 6;(61,...,0j;—1) is the MLE for 0; given 6;,...,0;—1 and 
pi is a prior supported on a compact rectangle Ky; x Ko; x... x Kaj.) 


b;(01,..-,9;) 
log — (6; |01, .. 03-1) d0; 
Ire, E o ea i 


+0,(1) (5.28) 


=K,+E 


3 





where K,, is a constant depending on n and 
1 
w;(A1, daras :0;) = exp a 3 log h;(0)p(6541, ae „0al, nia , 0; )dOj414 E aa} 


is the geometric mean of hy! (0) with respect to p(0;41,---,0g|01,-.-,9;). 
The proof of (5.28) parallels the proof of (5.4). It follows that asymptoti- 
cally (5.28) is maximized if we set 


p(O;|61, TE O70) = c, (61, eas 6,1); (01, eta sO) on Ky 
=(Q elsewhere. (5.29) 


If the dimension exceeds 2, say d = 3, we merely start with a compact 
rectangle Ki; x Ko; x K3; and set pi(03101, 02) = ci(01, 92) V 133 0) on K3;. 
Then we first determine p;(82|6;) and then p;(6@,) by applying (5.29) twice. 
Thus, 


pi(02101) = ¢;(01)H2(91, 02) on Ka; 
pi(91) = cyp (01) on Kii. 


One can also verify that the formulas obtained for d = 2 can be rederived in 
this way. A couple of examples follow. 


Example 5.4. Let X1, X2,..., Xn be i.i.d. normal with mean 6) and variance 
0,, with 6, being the parameter of interest. Here 


TE, 1 
I(0) = (7 i] » 122(9) = 6, 1" (0) = 26%. 
01 


Thus p;(92|01) = c; on Ko; where ee = volume of Kə;, and therefore, 


Ji 1 
ee th. ‘ lo pla f = 50, 


pili, 02) = d;(1/81) 


for some constant d; and the reference prior is taken as 


We then have 
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1 
p(81, 02) OX Ae 
1 


This is the right invariant Haar measure unlike the Jeffreys prior which is left 
invariant (see Subsection 5.1.7). 


Example 5.5. (Berger and Bernardo, 1992b) Let (X1, X2, X3) follow a multi- 
nomial distribution with parameters (n; 81, 02, 03), i.e., (X1, X2, X3) has den- 
sity 
p(£1, £2, £3181, 02, 03) 
n! 


= OP F273 O ĝi == bə _ 03 Nn—T1—T2—T3 
L1! £2! a3!(n — T1 — T2 — T3)! aane ( ) ? 


3 3 
zi > 0,i=1,2,3, $ ri<n, 6: >0,i= 1,2,3, $ 0: <1. 
1 1 
Here the information matrix is 
I(0) = n Diag{0,*, 07°, 03°} +n(1 — 6; — b2 — 63) "13, 
where Diag {a,,a@2,a3} denotes the diagonal matrix with diagonal elements 
a1, 42, a3 and 13 denotes the 3 x 3 matrix with all elements equal to one. Hence 
1 1 
8) = -Di — ~06@' 
S( ) „ Diag {81,02,03} nee 
and for j = 1,2,3 
í ESR 1 ; 
with 015) = (8i, EE 6;)’, 
H, (8) =n Diag {07t, a O57} ae n(1 a 0; Ss 6;)7 "1; 
and 
h;(@) = n07 (1 —~ 6, —--+-—9;_-,) -81 —---- 0;)~*. 
Note that h;(@) depends only on 6), 62,...,8; so that 


;(01,...,0;) = h1? (0). 


Jj 


Here we need not restrict to compact rectangles as all the distributions in- 
volved have finite mass. As suggested above for the general case, the reference 
prior can now be obtained as 


p(63101,02) = «10, /7(1 — 0, — b2 — 03) 12, 0< 03 <1— 0, — be, 
p(O2|0,) = 77165 °/2(1 — 0, — Oa) 1/2, 0< 0. <1— hy, 
pl) = 7167 Pa- 6) 7/7, 0<6 <1, 
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l.e., 
p(@) = n7307 (1—01) 1/207 P (1 — 01 — 89) -1/205 1/7 (1 — 01 — Op — 03) 71/2, 
bi > 0, i=1,2,3, 326; <1. 


As remarked by Berger and Bernardo (1992b), inferences about 6, based 
on the above prior depend only on x; and not on the frequencies of other cells. 
This is not the case with standard noninformative priors such as Jeffreys prior. 
See in this context Berger and Bernardo (1992b, Section 3.4). 


5.1.11 Reference Priors Without Entropy Maximization 


Construction of reference priors involves two interesting ideas. The first is 
the new measure of information in a prior obtained by comparing it with the 
posterior. The second idea is the step by step algorithm based on arranging 
the parameters in ascending order of importance. The first throws light on 
why an objective prior would depend on the model for the likelihood. But 
it is the step by step algorithm that seems to help more in removing the 
problems associated with the Jeffreys prior. We explore below what kind of 
priors would emerge if we follow only part of the Berger-Bernardo algorithm 
(of Berger and Bernardo (1992a)). 

We illustrate with two (one-dimensional) parameters 0; and 92 of which 
0, is supposed to be more important. This would be interpreted as meaning 
that the marginal prior for 6; is more important than the marginal prior for 
62. Then the prior is to be written as p(61)p(@2/61), with 


p(02|61) x y Ibo (8) 


and p(ĝ1) is to be determined suitably. 
Suppose, we determine p(9,) from the probability matching conditions 


PLO, < O1.0(X)|X}=1-a+0,(n"), (5.30) 
/ P{O1 <O1.4(X)161,02}p(02101)d0p =1—a+O(n-'). (5.31) 


Here (5.30) defines the Bayesian quantile 6)... of 81, which depends on data X 
and (5.31) requires that the posterior probability on the left-hand side of (5.30) 
matches the frequentist probability averaged out with respect to p(@2/41). 
Under the assumption that 6; and @2 are orthogonal, i.e., 12(8) = I21(@) = 0, 
one gets (Ghosh (1994)) 


zA 
p;(61) = constant (/ £3'/*(@)p(02|0s) a ) (5.32) 
Koi 


where Ky; x Ko; is a sequence of increasing bounded rectangles whose union 
is O, x O2. This equation shows the marginal of 6; is a (normalized) harmonic 
mean of y Iıı which equals the harmonic mean of 1/V J". 
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What about a choice of p;(@;) equal to the geometric or arithmetic mean? 
The Berger-Bernardo reference prior is the geometric mean. The marginal 
prior is the arithmetic mean if we follow the approach of taking weak limits 
of suitable discrete uniforms, vide Ghosal et al. (1997). In many interesting 
cases involving invariance, for example, in the case of location-scale families, 
all three approaches lead to the right invariant Haar measures as the joint 
prior. 


5.1.12 Objective Priors with Partial Information 


Suppose we have chosen our favorite so-called noninformative prior, say po. 
How can we utilize available prior information on a few moments of 0? Let 
p be an arbitrary prior satisfying the following constraints based on available 
information 


[ g;(0)p(0)d@ = Aj, 7=1,2,...,k. (5.33) 
© 


If g;(@) = gf? OPP , we have the usual moments of @. We fix increasing 
compact rectangles K; with union equal to © and among priors p; supported 
on K; and satisfying (5.33), minimize the Kullback-Leibler number 





K(pipo) = | pi(0) log ES a. 


The minimizing prior is 


k 
p;(@) = constant x exp 2 Àjgj o] po(8) 


1 


where \,’s are hyperparameters to be chosen so as to satisfy (5.33). This can 
be proved by noting that for all priors p; satisfying (5.33), 


K (pipe) = | pi(0) 10g 2? 


= constant + S àjA; + K (pi, Ppi) 


which is minimized at p; = př. 

If instead of moments we know values of some quantiles for (a one- 
dimensional) @ or more generally the prior probabilities a; of some disjoint 
subsets B; of O, then it may be assumed UB; = O and one would use the 
prior given by 


2 ah a j) 


Sun and Berger (1998) have shown how reference priors can be constructed 
when partial information is available. 
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5.2 Discussion of Objective Priors 


This section is based on Ghosh and Samanta (2002b). We begin by listing 
some of the common criticisms of objective priors. We refer to them below as 
“noninformative priors”. 


1. Noninformative priors do not exist. How can one define them? 

2. Objective Bayesian analysis is ad hoc and hence no better than the ad 
hoc paradigms subjective Bayesian analysis tries to replace. 

3. One should try to use prior information rather than waste time trying to 
find noninformative priors. 

4. There are too many noninformative priors for a problem. Which one is to 
be used? 

5. Noninformative priors are typically improper. Improper priors do not 
make sense as quantification of belief. For example, consider the uniform 
distribution on the real line. Let L be any large positive number. Then 
P{-L <6 < L}/P{6 ¢ (—-L, L)} =0 for all L but for a sufficiently large 
L, depending on the problem, we would be pretty sure that -L <80 < L. 

6. If 6 has uniform distribution because of lack of information, then this 

should also be true for any smooth one-to-one function 7 = g(@). 
. Why should a noninformative prior depend on the model of the data? 
8. What are the impacts of 7 on coherence and the likelihood principle? 


~] 


We make a couple of general comments first before replying to each of 
these criticisms. The purpose of introducing an objective prior is to produce a 
posterior that depends more on the data than the objective prior. One way of 
checking this would be to compare the posteriors for different objective priors 
as in Example 2.2 of Chapter 2. The objective prior is only the means for pro- 
ducing the posterior. Moreover, objective Bayesian analysis agrees that it is 
impossible to define a noninformative prior on an unbounded parameter space 
because maximum entropy need not be finite. This is the reason that increas- 
ing bounded sets were used in the construction. One thinks of the objective 
priors as consensus priors with low information — at least in those cases where 
no prior information is available. In all other cases, the choice of an objective 
prior should depend on available prior information (Subsection 5.1.12). We 
now turn to the criticisms individually. 

Points 1 and 2 are taken care of in the general comments. Point 3 is well 
taken, we do believe that elicitation of prior information is very important and 
any chosen prior should be consistent with what we know. A modest attempt 
towards this is made in Subsection 5.1.12. However, we feel it would rarely 
be the case that a prior would be fully elicited, only a few salient points or 
aspects with visible practical consequences can be ascertained, but subjected 
to this knowledge the construction of the prior would still be along the lines 
of Subsection 5.1.12 even though in general no explicit solution will exist. 

As to point 4, we have already addressed this issue in the general com- 
ments. Even though there is no unique objective prior, the posteriors will 
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usually be very similar even with a modest amount of data. Where this is not 
the case, one would have to undertake a robustness analysis restricted to the 
class of chosen objective priors. This seems eminently doable. 

Even though usually objective priors are improper, we only work with 
them when the posterior is proper. Once again we urge the reader to go over 
the general comments. We would only add that many improper objective 
priors lead to same posteriors as coherent, proper, finitely additive priors. 
This is somewhat technical, but the interested reader can consult Heath and 
Sudderth (1978, 1989). 

Point 6 is well taken care of by Jeffreys prior. Also in a given problem not 
all one-to-one transformations are allowed. For example, if the coordinates of 
0 are in a decreasing order of importance, then we need only consider 7 = 
(m1,.-.,7a) such that 7; is a one-to-one continuously differentiable function 
of 6;. There are invariance theorems for reference and probability matching 
priors in such cases, Datta and Ghosh (1996). 

We have discussed Point 7 earlier in the context of the entropy of Bernardo 
and Lindley. This measure depends on the experiment through the model of 
likelihood. Generally information in a prior cannot be defined except in the 
context of an experiment. Hence it is natural that a low-information prior 
will not be the same for all experiments. Because a model is a mathemati- 
cal description of an experiment, a low-information prior will depend on the 
model. 

We now turn to the last point. Coherence in the sense of Heath and Sud- 
derth (1978) is defined in the context of a model. Hence the fact that an 
objective prior depends on a model will not automatically lead to incoher- 
ence. However, care will be needed. As we have noted earlier, a right Haar 
prior for location-scale families ensures coherent inference but in general a left 
Haar prior will not. 

The impact on likelihood principle is more tricky. The likelihood principle 
in its strict sense is violated because the prior and hence the posterior depends 
on the experiment through the form of the likelihood function. However, for 
a fixed experiment, decision based on the posterior and the corresponding 
posterior risk depend only on the likelihood function. We pursue this a bit 
more below. 

Inference based on objective priors does violate the stopping rule prin- 
ciple, which is closely related to the likelihood principle. In particular, in 
Example 1.2 of Carlin and Louis (1996), originally suggested by Lindley and 
Phillips (1976), one would get different answers according to a binomial or a 
negative binomial model. This example is discussed in Chapter 6. 

To sum up we do seem to have good answers to most of the criticisms but 
have to live with some violations of the likelihood and stopping rule principles. 
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5.3 Exchangeability 


A sequence of real valued random variables {X;} is exchangeable if for all n, 
all distinct suffixes {i1, i2,..., in} and all By, Bo,...,By, C R, 


PIX; € Bi, Xi, € Borers Ai E Bat = P{ X; € Bi, Xə € B2,..., Xn E Bay: 


In many cases, a statistician will be ready to assume exchangeability as a 
matter of subjective judgment. 

Consider now the special case where each X; assumes only the values 0 and 
1. A famous theorem of de Finetti then shows that the subjective judgment 
of exchangeability leads to both a model for the likelihood and a prior. 


Theorem 5.6. (de Finetti) If X;’s are exchangeable and assume only val- 
ues 0 and 1, then there exists a distribution IT on (0,1) such that the joint 
distribution of X1,...,Xn can be represented as 


1 n 
P(X: = 1... Xn =m) = | J [ 6" - 6)'-* diz). 
0 i= 


This means X;’s can be thought of as i.i.d. B(1,8) variables, given 6, where 0 
has the distribution J. For a proof of this theorem and other results of this 
kind, see, for example, Bernardo and Smith (1994, Chapter 4). 

The prior distribution H can be determined in principle from the joint 
distribution of all the X;’s, but one would not know the joint distribution of 
all the X;’s. If one wants to actually elicit H, one could ask oneself what is 
one’s subjective predictive probability P{Xj;., = 1|X1, Xe,...,X;}. Suppose 
the subjective predictive probability is (a + 5° X;)/(a + 2 + i) where a > 0, 
B > 0. Then the prior for @ is the Beta distribution with hyperparameters a 
and 8. Nonparametric elicitations of this kind are considered in Fortini et al. 
(2000). 


5.4 Elicitation of Hyperparameters for Prior 


Although a full prior is not easy to elicit, one may be able to elicit hyperpa- 
rameters in an assumed model for a prior. We discuss this problem somewhat 
informally in the context of two examples, a univariate normal and a bivariate 
normal likelihood. 

Suppose X1, Xo,..., Xn are iid. N(yu,o7) and we assume a normal prior 
for u given g? and an inverse gamma prior for ¢*. How do we choose the 
hyperparameters? We think of a scenario where a statistician is helping a 
subject matter expert to articulate his judgment. 

Let p(u|o?) be normal with mean 7 and variance c*o? where c is a constant. 
The hyperparameter 7 is a prior guess for the mean of X. The statistician has 
to make it clear that what is involved is not a guess about u but a guess about 


2 
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the mean of p. So the expert has to think of the mean p itself as uncertain 
and subject to change. 

Assuming that the expert can come up with a number for 7, one may try 
to elicit a value for c in two different ways. If the expert can assign a range of 
variation for u (given c) and is pretty sure ņ will be in this range, one may 
equate this to n + 3co, 7 would be at the center of the range and distance of 
upper or lower limit from the center would be 3co. To check consistency, one 
would like to elicit the range for several values of g and see if one gets nearly 
the same c. In the second method for eliciting c, one notes that c* determines 
the amount of shrinking of X to the prior guess 7 in the posterior mean 


By TERS E 
E(u X) = e = 
57 + = m+ 1/c 





ce 


(vide Example 2.1 of Section 2.2). 
Thus if c? = 1/n, X and 7 have equal weights. If c? = 5/n, the weight of 
n diminishes to one fifth of the weight for X. In most problems one would not 
have stronger prior belief. 
We now discuss elicitation of hyperparameters for the inverse Gamma prior 
for aĉ given by i A 
E N 
p( ) B I'(a) 8 (a2)ati © ` 
The prior guess for ø? is [G(a@ — 1)| 7t. This is likely to be more difficult to 
elicit than the prior guess 7 about u. The shape parameter a can be elicited 
by deciding how much to shrink the Bayes estimate of g? towards the prior 
guess [G(a — 1)]~1. Note that the Bayes estimate has the representation 


E(02|X) = a-—1 1 P (n—1)/2 cme n(X —1)? | 
a-l+n/2B(a-1) a-1+4+n/2 (2a +n — 2)(1 + nc?) 
where (n — 1)s* = Y(X; — X)?. In order to avoid dependence on X one may 
want to do the elicitation based on 
2) 22 a=) 1 (n —1)/2 5 
PO) cn De eon ae 
The elicitation of prior for p and o, specially the means and variances of 
priors, may be based on examining related similar past data. 

We turn now to i.i.d. bivariate normal data (X;, Y;), i = 1,2,...,n. There 
are now five parameters, (ux, 0%), (uy, o2) and the correlation coefficient p. 
Also E(Y|X = z) = bo + bız, Var(Y|X = x) = o?, where o? = o? (1 — p°), 
By = poy/ox, Bo = py — (poy /ox) Ex. 

One could reparameterize in various ways. We adopt a parameterization 
that is appropriate when prediction of Y given X is a primary concern. We 
consider (ux, 0%) as parameters for the marginal distribution of X, which may 
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be handled as in the univariate case. We then consider the three parameters 
(a7, Bo, 81) of the conditional distribution of Y given X = z. 

The joint density may be written as a product of the marginal density of 
X, namely N(yx,0%) and the conditional density of Y given X = x, namely 
N (Bo + 812,07). The full likelihood is 


á 1 Te 2 1 1 
e a A E i 


It is convenient to rewrite Bo + Bizi as yo + yilzi — Z), with %1 = G1, Yo = 
Bo + y1T. Suppose we think of z;’s to be fixed. We concentrate on the second 
factor to find its conjugate prior. The conditional likelihood given the z;’s is 





s 1 F Š z7 Tl oa SLE pa 
| exp{- 55 GE eH a e e 


1 
ves 
where ĝo = J, 41 = (Mi — Y) (ti — Z)/Szr, and Srg = X (ti — i 

Clearly, the conjugate prior for g? is an inverse Gamma and the conjugate 
priors for yọ and Ņyı are independent normals whose parameters may be elicited 
along the same lines as those for the univariate normal except that more care 
is needed. The statistician could fix several values of x and invite the expert 
to guess the corresponding values of y. A straight line through the scatter plot 
will yield a prior guess on the linear relation between x and y. The slope of 
this line may be taken as the prior mean for G; and the intercept as the prior 
mean for G9. These would provide prior means for yo,-¥1 (for the given values 
of x;’s in the present data). The prior variances can be determined by fixing 
how much shrinkage towards a prior mean is desired. 

Suppose that the prior distribution of g? has the density 

1 1 


2) _ —1/(bo*) 
p( ) T (abe (ga? )eti z : 


Given o”, the prior distributions for yo and yı are taken to be independent 
normals N (uo, co?) and N (u1, cZo*) respectively. 

The marginal posterior distributions of yọ and yı with these priors are 
Student’s t with posterior means given by 





n Č 

E(yo\x, y) = ——— fo + — u 5.34 

(yolz, y) r 0 o? 0 ( ) 

d E( | ) = See vi + eL. i u (5 35) 
an 11x, a Í : 


As indicated above, these expressions may be used to elicit the values of co 
and cı. For elicitation of the shape parameter a of the prior distribution of 
o* we may use similar representation of E(o?|æ, y). Note that the statistics 


S? = S (yi — Fo — Ties — maa ~o and 7, are jointly sufficient for (o2, Yo, y1). 
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Table 5.1. Data on Water Flow (in 100 Cubic Feet per Second) at Two Points 
(Libby and Newgate) in January During 1931-43 in Kootenai River (Ezekiel and 
Fox, 1959) 


Year 1931 32 33 34 35 36 37 38 39 40 41 42 43 
Newgate(y) 19.7 18.0 26.1 44.9 26.1 19.9 15.7 27.6 24.9 23.4 23.1 31.3 23.8 
Libby(z) 27.1 20.9 33.4 77.6 37.0 21.6 17.6 35.1 32.6 26.0 27.6 38.7 27.8 


As in the univariate normal case considered above we do the elicitation based 
on 
-1 1 (n/2)—-1 . 
I CE a ee A AE e E 5.36 
8) = Gpy+a—2b(a—1) * (n/2)+a—2° ee 
where ô? = S?/(n — 2) is the classical estimate of o°. We illustrate with an 
example. 


Example 5.7. Consider the bivariate data of Table 5.1 (Ezekiel and Fox, 1959). 
This relates to water flow in Kootenai River at two points, Newgate (British 
Columbia, Canada) and Libby (Montana, USA). A dam was being planned 
on the river at Newgate, B.C., where it crossed the Canadian border. The 
question was how the flow at Newgate could be estimated from that at Libby. 

Consider the above setup for this set of bivariate data. Calculations yield 
Z = 32.5385, Sez = 2693.1510, 4o = 24.9615, 4 = 0.4748 and ô? = 3.186 so 
that the classical (least squares) regression line is 


y = 24.9615 + 0.4748(x — T), 
i.e., y = 9.5122 + 0.47482. 


Suppose now that we have similar past data D for a number of years, say, the 
previous decade, for which the fitted regression line is given by 


y = 10.3254 + 0.48092 


with an estimate of error variance (o?) 3.9363. We don’t present this set of 
“historical data” here, but a scatter plot is shown in Figure 5.1. As suggested 
above, we may take the prior means for 8o, 81 and a? to be 10.3254, 0.4809, 
and 3.9363, respectively. We, however, take these to be 10.0, 0.5, and 4.0 
as these are considered only as prior guesses. This gives wo = 10+ 0.5% = 
26.2693 and py, = 0.5. Given that the historical data set D was obtained 
in the immediate past (before 1931), we have considerable faith in our prior 
guess, and as indicated above, we set the weights for the prior means po, H1, 
and 1/(b(a —1)) of yo, 71, and g? in (5.34), (5.35), and (5.36) equal to 1/6 so 
that the ratio of the weights for the prior estimate and the classical estimate 
is 1 : 5 in each case. Thus we set 
cy ey? =d 1 


nto? Sato? ataa 
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ae yields, with n = 13 and S,, = 2693.151 for the current data, cy? =) 2.6, 

= 538.63, and a = 2.1. If, however, the data set D was older, we woul’ 
cach less weight (less than 1/6) to the prior means. Our prior guess for g? is 
1/(b(a —1)). Equating this to 4.0 we get b = 0.2273. Now from (5.34)-(5.36) 
we obtain the Bayes estimates of yọ, y1 and g? as 25.1795, 0.4790 and 3.3219 
respectively. The estimated regression line is 


y = 25.1795 + 0.4790(x — 2), 
i.e., y = 9.5936 + 0.47902. 


The scatter plots for the current Kootenai data of Table 5.1 and the histor- 
ical data D as well as the classical and Bayesian regression line estimates 
derived above are shown in Figure 5.1. The symbols “o” for current and “=” 
for historical data are used here. The continuous line stands for the Bayesian 
regression line based on the current data and the prior, and the broken line 
represents the classical regression line based on the current data. The Bayesian 
line seems somewhat more representative of the whole data set than the clas- 
sical regression line, which is based on the current data only. If one fits a 
classical regression line to the whole data set, it will attach equal importance 
to the current and the historical data; it is a choice between all or nothing. 
The Bayesian method has the power to handle both current data and other 
available information in a flexible way. 


The 95% HPD credible intervals for yo and yı based on the poste- 
rior ¢-distributions are respectively (21.4881, 28.8708) and (0.2225, 0.7355), 
which are comparable with the classical 95% confidence intervals ~-- (23.8719, 
26.0511) for yo and (0.3991, 0.5505) for y1. Note that, as expected, the 
Bayesian intervals are more conservative than the classical ones, the Bayesian 
providing for the possible additional variation in the parameters. If one uses 
the objective prior p(yo,71,07) x 1/07, the objective Bayes solutions would 
agree with the classical estimates and confidence intervals. 

All of the above would be inapplicable if x and y have the same footing and 
the object is estimation of the parameters in the model rather than prediction 
of unobserved y’s for given z’s. In this case, one would write the bivariate 
normal likelihood 

TL 
l 2roxsy y1 — p? 

1 h) | ar (zi — ux) (yi — uy) 
TUD (eget tet weeny 


The conjugate prior is a product of a bivariate normal and an inverse- Wishart 
distribution. Notice that we have discussed elicitation of hyperparameters for 
a conjugate prior for several parameters. Instead of substituting these elicited 
values in the conjugate prior, we could treat the hyperparameters as having 
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Bayesian regression line 


Classical regression line 


Current data 


Historical data 





Fig. 5.1. Scatter plots and regression lines for the Kootenai River data. 


a prior distribution over a set of values around the elicited numbers. The 
prior distribution for the hyperparameters could be a uniform distribution on 
the set of values around the elicited numbers. This would be a hierarchical 
prior. An alternative would be to use several conjugate priors with different 
hyperparameter values from the set around the elicited numbers and check 
for robustness. 

Elicitation of hyperparameters of a conjugate prior for a linear model is 
treated in Kadane et al. (1980), Garthwaite and Dickey (1988, 1992), etc. 
A recent review is Garthwaite et al. (2005). Garthwaite and Dickey (1988) 
observe that the prior variance-covariance matrix of the regression coefficients, 
specially the off-diagonal elements of the matrix, are the most difficult to elicit. 
We have assumed the off-diagonal elements are zero, a common simplifying 
assumption, and determined the diagonal elements by eliciting how much 
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shrinkage towards the prior is sought in the Bayes estimates of the means of 
the regression coefficients. Garthwaite and Dickey (1988) indicate an indirect 
way of eliciting the variance-covariance matrix. 


5.5 A New Objective Bayes Methodology Using 
Correlation 


As we have already seen, there are many approaches for deriving objective, 
reference priors and also for conducting default Bayesian analysis. One such 
approach that relies on some new developments is discussed here. Using the 
Pearson correlation coefficient in a rather different way, DasGupta et al. (2000) 
and Delampady et al. (2001) show that some of its properties can lead to 
substantial developments in Bayesian statistics. 

Suppose X is distributed with density f(x|0) and 0 has a prior 7. Let the 
joint probability distribution of X and @ under the prior m be denoted by P. 
We can then consider the Pearson correlation coefficient p, between two func- 
tions g1(X, 0) and g2(X, 0) under this probability distribution P. An objective 
prior in the spirit of reference prior can then be derived by maximizing the cor- 
relation between two post-data summaries about the parameter @, namely the 
posterior density and the likelihood function. Given a class I’ of priors, Delam- 
pady et al. (2001) show that the prior 7 that maximizes p,{f(z|0), x(@|x)} 
is the one with the least Fisher information I(r) = E™{S log 7(@)}" in the 
class I’. Actually, Delampady et al. (2001) note that it is very difficult to 
work with the exact correlation coefficient p,{f(x|@), 7(@|x”)} and hence they 
maximize an appropriate large sample approximation by assuming that the 
likelihood function and the prior density are sufficiently smooth. The following 
example is from Delampady et al. (2001). 





Example 5.8. Consider a location parameter 0 with |@| < 1. Assume that 
f and az are sufficiently smooth. Then the prior density which achieves the 
minimum Fisher information in the class of priors compactly supported on 
[—1, 1] is what is desired. Bickel (1981) and Huber (1974) show that this prior 


x(0) = A if Jð] < 1; 


0, otherwise. 


Thus, the Bickel prior is the default prior under the correlation criterion. The 
variational result that obtains this prior as the one achieving the minimum 
Fisher information was rediscovered by Ghosh and Bhattacharya in 1983 (see 
Ghosh (1994); a proof of this interesting result is also given there). 


In addition to the above problem, it is also possible to identify a robust 
estimate of 9 using this approach. Suppose 7 is a reference prior, and I’ is 
a class of plausible priors. 7 may or may not belong to J’. Let 6, be the 
Bayes estimate of 0 with respect to prior v € I’. To choose an optimal Bayes 
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Table 5.2. Values of 6,(X) and 6,(X) for Various X = x 
Oo E 1.5 2 3 5 8 10 15 





estimate, maximize (over v € I’) the correlation coefficient p,(6,6,) between 
0 and 6,. Note that p, is calculated under P, and in this joint distribution 
0 follows the reference prior m. Again, by maximizing an appropriate large 
sample approximation 6,(6,6,) (assuming that the likelihood function and 
the prior density are sufficiently smooth), Delampady et al. (2001) obtain 
the following theorem. 


Theorem 5.9. The estimate 6,(X) maximizing p,(6,6,) is Bayes with re- 
spect to the prior density 


v(@) = cn(@) exp {-3 r° (0 — u)? (5.37) 


where 1,77 are arbitrary and c is a normalizing constant. 


The interesting aspect of this reference prior v is evident from the fact that 
it is simply a product of the initial reference prior 7 and a Gaussian factor. 
This may be interpreted as v being the posterior density when one begins 
with a flat prior m and it is revised with an observation 6 from the Gaussian 
distribution to pull in its tails. Consider the following example again from 
Delampady et al. (2001). 


Example 5.10. Consider the reference prior density 7(@) « (1 + 6?/3)~?, 
density of the Student’s t3 prior, a flat prior. Suppose that the family I’ 
contains only symmetric priors and so v(@) is of the form v(@) = e(1 + 
6? /3)~? exp { —67/(2r7)}. Let X ~ Cauchy(6,o), with known ø and having 


density i 
(al) = — + 5A) 


Some selected values are reported in Table 5.2 for gø = 0.2. For small and 
moderate values of x, 6, and 6, behave similarly, whereas for large values, 6, 
results in much more shrinkage than 6,. This is only expected because the 
penultimate v has normal tails, whereas the reference 7 has flat tails. 





5.6 Exercises 


1. Let X ~ B(n, p). Choose a prior on p such that the marginal distribution 
of X is uniform on {0,1,... n}. 


10. 


Lt. 


12. 
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. (Schervish (1995, p.121)). Let h be a function of (x, 8) that is differentiable 


in @. Define a prior 
p*(8) = [Varo ((8/00)h(X, 4))]""". 


(a) Show that p*(@) satisfies the invariance condition (5.1). 
(b) Choose a suitable h such that p*(@) is the Jeffreys prior. 


. Prove (5.16) and generalize it to the multiparameter case. 
. (Lehmann and Casela (1998)) For a scale family, show that there is an 


equivariant estimate of o* that minimizes E(T — o*)?/o0?*. Display the 
estimate as the ratio of two integrals and interpret as a Bayes estimate. 


. Consider a multinomial distribution. 


(a) Show that the Dirichlet distribution is a conjugate prior. 

(b) Identify the precision parameter for the Dirichlet prior distribution. 
(c) Let the precision parameter go to zero and identify the limiting prior 
and posterior. Suggest why the limiting posterior, but not the limiting 
prior, is used in objective Bayesian analysis. 


. Find the Jeffreys prior for the multinomial model. 
. Find the Jeffreys prior for the multivariate normal model with unknown 


mean vector and dispersion matrix. 


. (a) Examine why the Jeffreys prior may not be appropriate if the param- 


eter is not identifiable over the full parameter space. 

(b) Show with an example that the Jeffreys prior may not have a proper 
posterior. (Hint. Try the following mixture: X = 0 with probability 1/2 
and is N(yu,1) with probability 1/2.) 

(c) Suggest a heuristic reason as to why the posterior is often proper if we 
use a Jeffreys prior. 


. Bernardo has recently proposed the use of min{K (fo, f1), K(f, fo) }, in- 


stead of K (fo, f1), as the criterion to maximize at each stage of reference 
prior. Examine the consequence of this change for the examples of refer- 
ence priors discussed in this chapter. 

Given (1,07), let X1,..., Xn be iid. N(s,07) and consider the prior 
n(u, o?) x 1/o%. Verify that 


P{X = taj2n-185/ Vn < H < X + tasan_18/Vn|p, 07} 
= P{X = ta/2,n-18/ VN SUS X T ta/2n-18/Vn| X1, eA Xn} 
= ]l-a, 
where X is the sample mean, s? = } (X; — X)?/(n — 1), and tg 2 is the 
upper a/2 point of tn-1,0 <Q < 1. 
Given 0 < 0 < 1, let Xj,...,X, be iid. B(1,6@). Consider the Jeffreys 


prior for 0. Find by simulation the frequentist coverage of @ by the two- 


tailed 95% credible interval for @ = L, L, 5 ž, Z, Do the same for the usual 
frequentist interval 6+ 20.025 1/ A(1 — Ô) /n where 6 = S| X;/n. 


Derive (5.32) from an appropriate probability matching equation. 


6 


Hypothesis Testing and Model Selection 


For Bayesians, model selection and model criticism are extremely important 
inference problems. Sometimes these tend to become much more complicated 
than estimation problems. In this chapter, some of these issues will be dis- 
cussed in detail. However, all models and hypotheses considered here are low- 
dimensional because high-dimensional models need a different approach. The 
Bayesian solutions will be compared and contrasted with the corresponding 
procedures of classical statistics whenever appropriate. Some of the discussion 
in this chapter is technical and it will not be used in the rest of the book. Those 
sections that are very technical (or otherwise can be omitted at first reading) 
are indicated appropriately. These include Sections 6.3.4, 6.4, 6.5, and 6.7. 
In Sections 6.2 and 6.3, we compare frequentist and Bayesian approaches to 
hypothesis testing. We do the same in an asymptotic framework in Section 
6.4. Recently developed methodologies such as the Bayesian P-value and some 
non-subjective Bayes factors are discussed in Sections 6.5 and 6.7. 


6.1 Preliminaries 


First, let us recall some notation from Chapter 2 and also let us introduce 
some specific notation for the discussion that follows. 

Suppose X having density f(x|@) is observed, with 0 being an unknown el- 
ement of the parameter space ©. Suppose that we are interested in comparing 
two models Mp and Mı, which are given by 


Mo: X has density f(x|0)} where @ € Op; 
Mı : X has density f(x|@) where @ € O4. (6.1) 


For i = 0,1 let g;(@) be the prior density of 0, conditional on M; being the 
true model. Then, to compare models Mo and M, on the basis of a random 
sample x = (21,...,%,,) one would use the Bayes factor 
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mo(x) 


Bo1(x) = are, 


(6.2) 
where 
nee J f(xl0)g:(0) d0, i=0,1. (6.3) 
O; 


We also use the notation BFo, for the Bayes factor. Recall from Chapter 2 
that the Bayes factor is the ratio of posterior odds ratio of the hypotheses to 
the corresponding prior odds ratio. Therefore, if the prior probabilities of the 
hypotheses, 79 = P™(Mo) = P" (Oo) and mı = P™(M,) = P" (01) = 1 — mo 
are specified, then as in (2.17), 





P(Mp|x) = fı E: mi Bo) (6.4) 


Thus, if conditional prior densities gg and gı can be specified, one should sim- 
ply use the Bayes factor Bo; for model selection. If, further 7o is also specified, 
the posterior odds ratio of Mp to Mı can also be utilized. However, these com- 
putations may not always be easy to perform, even when the required prior 
ingredients are fully specified. A possible solution is the use of BIC as an 
approximation to a Bayes factor. We study this in Subsection 6.1.1. The situ- 
ation can get much worse when the task of specifying these prior inputs itself 
becomes a difficult problem as in the following problem. 


Example 6.1. Consider the problem that is usually called nonparametric re- 
gression. Independent responses y; are observed along with covariates 2;, 
1=1,...,n. The model of interest is 


Yi = g(x£i) + €i, i = besis (6.5) 


where e; are i.i.d. N(0, g?) errors with unknown error variance o°. The func- 
tion g is called the regression function. In linear regression, g is a priori as- 
sumed to be linear in a set of finite regression coefficients. In general, g can be 
assumed to be fully unknown also. Now, if model selection involves choosing 
g from two different fully nonparametric classes of regression functions, this 
becomes a very difficult problem. Computation of Bayes factor or posterior 
odds ratio is then a formidable task. Various simplifications including reduc- 
ing g to be semi-parametric have been studied. In such cases, some of these 
problems can be handled. 


Consider a different model checking problem now, that of testing for nor- 
mality. This is a very common problem encountered in frequentist inference, 
because much of the inferential methodology is based on the normality as- 
sumption. Simple or multiple linear regression, ANOVA, and many other tech- 
niques routinely use this assumption. In its simplest form, the problem can 
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be stated as checking whether a given random sample X1, X2,..., Xn arose 
from a population having the normal distribution. In the setup given above 
in (6.1), we may write it as 


Mo : X is N(p,07) with arbitrary u and o? > 0; 
Mı : X does not have the normal distribution. (6.6) 


However, this looks quite different from (6.1) above, because Mı does not 
constitute a parametric alternative. Hence it is not clear how to use Bayes 
factors or posterior odds ratios here for model checking. The difficulty with 
this model checking problem is clear: one is only interested in Mp and not in 
Mı. 

This problem is addressed in Section 6.3 of Gelman et al. (1995). See 
also Section 9.9. We use the posterior predictive distribution of replicated 
future data to assess whether the predictions show systematic differences. 
In practice, replicated data will not be available, so cross-validation of some 
form has to be used, as discussed in Section 9.9. Gelman et al. (1995) have 
not used cross-validation and their P-values have come in for some criticism 
(see Section 6.5). 

The object of model checking is not to decide whether the model is true or 
false but to check whether the model provides plausible approximation to the 
data. It is clear that we have to use posterior predictive values and Bayesian 
P-values of some sort, but consensus on details does not seem to have emerged 
yet. It remains an important problem. 


6.1.1 BIC Revisited 


Under appropriate regularity conditions on f, gg, and g1, the Bayes factor 
given in (6.2) can be approximated using the Laplace approximation or the 
saddle point approximation. Let us change notation and express (6.3) as fol- 
lows: 


mulk) = [ #0x16:)9(6:) 08,,1 =) (6.7) 


where 0; is the p;-dimensional vector of parameters under M;, assumed to be 
independent of n (the dimension of the observation vector x). Let 6; be the 
posterior mode of 8;, 7 = 0,1. Assume 6; is an interior point of O;. Then, 
expanding the logarithm of the integrand in (6.7) around @; using a second- 
order ‘Taylor series approximation, we obtain 


log (f (x|8:)g:(8:)) = log (18): 8) = : (0; = ð) Hz, (0; = 6;) 


where Hy is the corresponding negative Hessian. Applying this approximation 
to (6.7) yields, 
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= m 1 sl ae 
mi(x) © f(x|0:)9i(9:) [exw 1-5 (0; = ð) Hy (0; = 5) | dð; 
= f(x|8:)g: (81) (27) P HZ 1. (6.8) 


2 log Bo; is a commonly used evidential measure to compare the support pro- 
vided by the data x for Mo relative to Mı. Under the above approximation 


we have, 
Cama ð 
2 log(Bo1) ~ 2log Cae = + 2log (e) 





f(x|@1) (0:1) 
+( )\log(2r) +1 5, | 
Po — Pı) 108 277 0g "| 


A variation of this approximation is also commonly used, where instead of the 
posterior mode 8;, the maximum likelihood estimate @; is employed. Then, 
instead of (6.8), one obtains 


mi (x) © f (x|6;)9s(8:) (2n)?*/? | TA (6.9) 


Here Hg, is the observed Fisher information matrix evaluated at the maximum 
likelihood estimator. If the observations are i.i.d. we have that Hz, = = nH, < 16, 
where H, 4 is the observed Fisher information matrix obtained fon a single 


at In this case, 


mix) Z f(xlð; )9i(0 6;)(Qn)Pi/2n- ele ae 


f(x|80) go(o) 
2 lo Boi = 2lo = 2 lo ae 
es Cay ie (2&2) 


—i1 


and hence 











z| 
n 1,0 

— l + lo |, 6.10 
— (Po pı) 08 5 on og H1 | ( ) 

1,81 

An approximation to (6.10) correct to O(1) is 
x|ð 
2log(Bo1) = 2log (2 H ) - (po -= pı)logn (641) 
f(x|81) 


This is the approximate Bayes factor based on the Bayesian information crite- 
rion (BIC) due to Schwarz (1978). The term (po — pı) logn can be considered 
a penalty for using a more complex model. 
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A related criterion is 





f(x/6o) 
lo ~ —2 0 — P1 6.12 
s (22) (po — pı) (6.12) 


which is based on the Akaike information criterion (AIC), namely, 
AIC = 2log f(x|@) — 2p 


for a model f(x|0). The penalty for using a complex model is not as drastic 
as that in BIC. 

A Bayesian interpretation of AIC for high-dimensional prediction problems 
is presented in Chapter 9. Problem 16 of Chapter 9 invites you to explore if 
AIC is suitable for low-dimensional testing problems. 


6.2 P-value and Posterior Probability of Hp as Measures 
of Evidence Against the Null 


One particular tool from classical statistics that is very widely used in applied 
sciences for model checking or hypothesis testing is the P-value. It also hap- 
pens to be one of the concepts that is highly misunderstood and misused. ‘The 
basic idea behind R.A. Fisher’s (see Fisher (1973)) original (1925) definition 
of P-value given below did have a great deal of appeal: It is the probability 
under a (simple) null hypothesis of obtaining a value of a test statistic that is 
at least as extreme as that observed in the sample data. 
Suppose that it is desired to test 


Ho : 6 = ĝo versus H; : 0 Æ 0o, (6.13) 


and that a classical significance test is available and is based on a test statistic 
T(X), large values of which are deemed to provide evidence against the null 
hypothesis. If data X = x is observed, with corresponding t = T(x), the 
P-value then is 

a = Pa (T(X) > T(2)). 


Example 6.2. Suppose we observe X1,..., Xn i.i.d. from N(6,07), where o? 
is known. Then X is sufficient for 0 and it has the N(@,07/n) distribu- 
tion. Noting that T = T(X) = |vn (X — bo) /o|, is a natural test statis- 
tic to test (6.13), one obtains the usual P-value as a = 2/1 — &(t)], where 
t = |yn (z — ĝo) /o| and © is the standard normal cumulative distribution 
function. 


Fisher meant P-value to be used informally as a measure of degree of 
surprise in the data relative to Hp. This use of P-value as a post-experimental 
or conditional measure of statistical evidence seems to have some intuitive 
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justification. From a Bayesian point of view, various objections have been 
raised by Edwards et al. (1963), Berger and Sellke (1987), and Berger and 
Delampady (1987), against use of P-values as measures of evidence against 
Ho. A recent review is Ghosh et al. (2005). 

To a Bayesian the posterior probability of Hp summarizes the evidence 
against Ho. In many of the common cases of testing, the P-value is smaller 
than the posterior probability by an order of magnitude. The reason for this 
is that the P-value ignores the likelihood of the data under the alternative 
and takes into account not only the observed deviation of the data from the 
null hypothesis as measured by the test statistic but also more significant 
deviations. In view of these facts, one may wish to see if P-values can be 
calibrated in terms of bounds for posterior probabilities over natural classes of 
priors. It appears that calibration takes the form of a search for an alternative 
measure of evidence based on posterior that may be acceptable to a non- 
Bayesian. In this connection, note that there is an interesting discussion of 
the admissibility of P-value as a measure of evidence in Hwang et al. (1992). 


6.3 Bounds on Bayes Factors and Posterior Probabilities 


6.3.1 Introduction 


We begin with an example where P-values and the posterior probabilities are 
very different. 


Example 6.3. We observe X ~ N(@,07/n), with known o?. Upon using T = 
| /n (X — 09) /o| as the test statistic to test (6.13), recall that the P-value 
comes out to be a = 2[1 — @(t)|, where t = |,\/n(% — 4) /o| and @ is the 
standard normal cumulative distribution function. On the set {6 4 ĝo}, let 6 
have the density (g1) of N(,7*). Then, we have, 


Boi = Vit pF exp {5 eee a T 


2} 0+) 


where p = 0 /(,/nr) and n = (69 — u)/T. Now, if we choose u = ĝo, T = 0 and 
Tto = 1/2, we get, 


m- Vaal eia} 


(+°) 


For various values of t and n, the different measures of evidence, a = 
P-value, B = Bayes factor, and P = P(Hoļ|x) are displayed in Table 6.1 as 
shown in Berger and Delampady (1987). It may be noted that the posterior 
probability of Ho varies between 4 and 50 times the corresponding P-value 
which is an indication of how different these two measures of evidence can be. 
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Table 6.1. Normal Example: Measures of Evidence 






ae PB Pe Pe PE Pe 
1.645}.10 |.72 .42).79 .44|.89 .47/1.27 .56/1.86 .65:2.57 .72 
1.960|.05 1.54 .35).49 .33|.59 .37) .72 .42)1.08 .52/1.50 .60 
21 211-15 -13-16-14 -19 I6 228222) .37 227 
3.291|.001|.10 .09!.03 .031.02 .02| .03 .03| .03 .03| .05 .05 









6.3.2 Choice of Classes of Priors 


Clearly, there are irreconcilable differences between the classical P-value and 
the corresponding Bayesian measures of evidence in the above example. How- 
ever, one may argue that the differences are perhaps due to the choice of 7 
or gı that cannot claim to be really ‘objective.’ The choice of 79 = 1/2 may 
not be crucial because the Bayes factor, B, which does not need this, seems 
to be providing the same conclusion, but the choice of g; does have substan- 
tial effect. To counter this argument, let us consider lower bounds on B and 
P over wide classes of prior densities. What is surprising is that even these 
lower bounds that are based on priors ‘least favorable’ to Ho are typically an 
order of magnitude larger than the corresponding P-values for precise null hy- 
potheses. The other motivation for looking at bounds over classes of priors is 
that they correspond with robust Bayesian answers that are more compelling 
when an objective choice for a single prior does not exist. Thus, in the case 
of precise null hypotheses, if G is the class of all plausible conditional prior 
densities g} under Hop, we are then lead to the consideration of the following 
bounds. 

BOH=Ae hne i O (6.14) 

JEG SUPgeq Mg (x) 


where M(x) = fog, f(z|?)9(@) d0, and 


—1 


aS eo a (6.15) 





P(Ho|G,x) = inf P(Hojz) = |1+ 
gEG To 
This brings us back to the question of choice of the class G as in Chapter 3, 
where the robust Bayesian approach has been discussed. As explained there, 
robustness considerations force us to consider classes that are neither too 
large nor too small. Choosing the class G4 = {all densities} certainly is very 
extreme because it allows densities that are severely biased towards H,. Quite 
often, the class Gyo = {all natural conjugate densities with mean 69} is an 
interesting class to consider. However, this turns out to be inadequate for 
robustness considerations. The following class 


Gus = {all densities symmetric about ĝo and non-increasing in |6 — 99|} 
(6.16) 
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which strikes a balance between these two extremes seems to be a good choice. 
Because we are comparing various measures of evidence, it is informative to 
examine the lower bounds for each of these classes. In particular, we can 
gather the magnitudes of the differences between these measures across the 
classes. To simplify proofs of some of the results given below, we restate a 
result indicated in Section 3.8.1. 


Lemma 6.4. Suppose Cr is a set of prior probability measures on RP given 
by Cr ={y%:t ET}, T C R4, and let C be the conver hull of Cr. Then 


sup | f(x|6) dx(@) = sup f. f(x|@) di, (8). (6.17) 


rEC JO 


Proof. Because C > Cr, LHS > RHS in (6.17). However, as f f(x|0) dx(0) = 
J f(z|6) le v,(d@) (dt), for some probability measure p on T, using Fubini’s 


theorem, 
IRGO dr(0) = | FEO) | an dult) 


-f (J reoeo) du(t) 
< sup fs falan 
Therefore, 
sup f f(a/8)dx(0) < sup | F(l0) dve(, 


yielding the other inequality also. O 


The following results are from Berger and Sellke (1987) and Edwards et 
al. (1963). 


Theorem 6.5. Let 6(x) be the mazimum likelihood estimate of 8 for the ob- 
served value of x. Then 


f(z|8) 
B(Ga4,z2) = —-— -," 6.18 
B(Ga,7z) Halla) (6.18) 
E E t= BGA) (6.19) 


In view of Lemma 6.4, the proof of this result is quite elementary, once it 
is noted that the extreme points of G4 are point masses. 


Theorem 6.6. Let Us be the class of all uniform distributions symmetric 
about ĝo. Then 


B(Gus,«) = B(Us, z<), (6.20) 
P(Ho|Gus, x) = P(Ho|Us, zx). (6.21) 
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Proof. Simply note that any unimodal symmetric distribution is a mixture of 
symmetric uniforms, and apply Lemma 6.4 again. O 


Because B(Us, x) = f(x|90)/supgcy, M(x), computation of 


sup M(x) = ey fs (x|0)g(0) 
gEUs 

is required to employ Theorem 6.6. Also, it may be noted that as far as ro- 
bustness is considered, using the class Gys of all symmetric unimodal priors 
is the same as using the class Us of all symmetric uniform priors. It is per- 
haps reasonable to assume that many of these uniform priors are somewhat 
biased against Ho, and hence we should consider unimodal symmetric prior 
distributions that are smoother. One possibility is scale mixtures of normal 
distributions having mean ĝo. This class is substantially larger than just the 
class of normals centered at ĝo; it includes Cauchy, all Student’s t and so on. 
To obtain the lower bounds, however, it is enough to consider 


GNor = { all normal distributions with mean ĝo}, 
in view of Lemma 6.4. 


Example 6.7. Let us continue with Example 6.3. We have the following results 
from Berger and Sellke (1987) and Edwards et al. (1963). 

(i) B(G4,x) = exp(—4), because the ae of 0 is z; hence 

(ii) P(Ho|Ga,z) = [1+ “0 exp(4 i: 

(iii) Ift < 1, BiGy6.2) — = eal and Pleas = 1. This is because in 
this case, the unimodal symmetric distribution that maximizes m(x) is the 
degenerate distribution that puts all its mass at ĝo. 


(iv) Ift > 1, the g € Gys that maximizes m,(x) is non-degenerate and from 
Theorem 6.6 and Example 3.4, 


ee o o 
ee) Cipso eeu) CUO} 


(vy It t= 1, BiGwe t) =l and P(HolG Norn t) = mo Et >L, 


B(Gor,2) = tepl- E), 


For various values of t, the different measures of evidence, a = P-value, B = 
lower bound on Bayes factor, and P = lower bound on P(Ho|x) are displayed 
in Table 6.2. mo has been chosen to be 0.5. 

What we note is that the differences between P-values and the correspond- 
ing Bayesian measures of evidence remain irreconcilable even when the lower 
bounds on such measures are considered. In other words, even the least possi- 
ble Bayes factor and posterior probability of Hp are substantially larger than 
the corresponding P-value. This is so, even for the choice G4, which is rather 
astonishing (see Edwards et al. (1963)). 
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Table 6.2. Normal Example: Lower Bounds on Measures of Evidence 





.639 .390 “701 412 
408 .290|.473 .321 
036 .035 {.122 .109/.153 .133 
0044 .0044}.018 .018|.024 .0235 











6.3.3 Multiparameter Problems 


It is not the case that the discrepancies between P-values and lower bounds on 
Bayes factor or posterior probability of Hg are present only for tests of precise 
null hypotheses in single parameter problems. This phenomenon is much more 
prevalent. We shall present below some simple multiparameter problems where 
similar discrepancies have been discovered. The following result on testing a 
p-variate normal mean vector is from Delampady (1986). 


Example 6.8. Suppose X ~ N,(0,I), where X = (X1, X2,..., Xp) and 0 = 
(0, 92,...,0,). It is desired to test 


Ho : 0 = 0? versus H; : 0 4 0°, 
where 0° = (0?, 03, .-.,09) is a specified vector. The classical test statistic is 
T(X) = ||X - 6° ||’, 
which has a x% distribution under Hp. Thus the P-value of the data x is 
a= P(xp > T(x)). 


Consider the class Gygp of unimodal spherically symmetric (about 8°) prior 
distributions for 8, the natural generalization Gys. This will consist of den- 
sities g(@) of the form g(@) = h((@ — 6°)’(@ — 6°)), where h is non-increasing. 
Noting that any unimodal spherically symmetric distribution is a mixture of 
uniforms on symmetric spheres, and applying Lemma 6.4, we obtain 

|0) dd, 


sup m,(x ae 
g€Gusp al) ~ aso VAR] 6- aia 


where V (k) is the volume of a sphere of radius k, and f(x|9) is the N,(8, J) 


density. Therefore, we have that, 

exp(—35||x — 9°||?) 
B Giani) SS et 
SUPk>0 vV fjo-o0jj<k exp(~}]|x — 0|?) dé 


Using this result, numerical values were computed for different dimen- 
sions, p and different P-values, œ. In Table 6.3 we present these values where 
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Table 6.3. Multivariate Normal case Lower Bounds on Measures of Evidence 


B 











B denotes B(Gugp,xz) and P denotes P(Ho|Gugp,x) for mo = 0.5. As can 
be readily seen, the lower bounds remain substantially larger than the corre- 
sponding P-values in all dimensions. 


Note that spherical symmetry is not the only generalization of symme- 
try from one dimension to higher dimensions. Very different answers can be 
obtained if, for example, elliptical symmetry is used instead. Suppose we con- 
sider densities of the form g(@) = \/|Q|h((@ — 6°)’Q(@ — 8°)), where Q is an 
arbitrary positive definite matrix and h is non-increasing. Then the following 
result, which is informally stated in Delampady and Berger (1990), obtains. 
For the sake of simplicity, let us take 9° = 


Theorem 6.9. Let f(x|0) be a multivariate, multiparameter density. Con- 
sider the class of elliptically symmetric unimodal prior densities 


GUES = 19 : 9(8) = 1Q|2h(6’Q8), h non-increasing, Q positive definite i 


(6.22) 
Then 


eg = x0 fsa ta V(k) i FxiQ-*a)d a} BE Oe 


where V (k) is the volume of a sphere of radius k, and Q > 0 denotes that Q 
is positive definite. 


Proof. Note that 


sup M(x) sup | #0<\0)9(0) a8 


g9€Gues g€GuES 


sup / f(x|6)h(6'Q0)|QI3 d0 
h,Q 
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— i [sup f FQ uhun) du} (6.24) 


1 al 
T Q>0 fsa Vik) - En au}. 


because the maximization of the inside integral over non-increasing h in (6.24) 
is the same as maximization of that integral over the class of unimodal spher- 
ically symmetric densities, and hence Lemma 6.4 applies. O 


Consider the above result in the context of Example 6.8. The lower bounds 
on the Bayes factor as well as the posterior probability of the null hypothesis 
will be substantially lower if we use the class Gy gs rather than Gygp. This is 
immediate from (6.23), because the lower bounds over Gygp correspond with 
the maximum in (6.23) with Q = I. The result also questions the suitability 
of Gy gs for these lower bounds in view of the fact that the lower bounds will 
correspond with prior densities that are extremely biased towards Hı. 

Many other esoteric classes of prior densities have also been considered 
by some for deriving lower bounds. In particular, generalization of symme- 
try from the single-parameter case to the multiparameter case has been ex- 
amined. DasGupta and Delampady (1990) consider several subclasses of the 
symmetric star-unimodal densities. Some of these are mixtures of uniform 
distributions on £p (for p = 1, 2,00), class of distributions with components 
that are independent symmetric unimodal distributions and a certain subclass 
of one-symmetric distributions. Note that mixtures of uniform distributions 
on £2 balls are simply unimodal spherically symmetric distributions, whereas 
mixtures of uniform distributions on £; balls contain distributions whose com- 
ponents are i.i.d. exponential distributions. Uniform distributions on hyper- 
cubes form a subclass of mixtures of uniforms on Lə balls. Also considered 
there is the larger subclass consisting of distributions whose components are 
identical symmetric unimodal distributions. Another class of one-symmetric 
distributions considered there is of interest because it contains distributions 
whose components are i.i.d. Cauchy. Even though studies such as these are 
important from robustness considerations, we feel that they do not necessarily 
add to our understanding of possible interpretation of P-values from a robust 
Bayesian point of view. However, interested readers will find that Dharmad- 
hikari and Joag-Dev (1988) is a good source for multivariate unimodality, and 
Fang et al. (1990) is a good reference for multivariate symmetry for material 
related to the classes mentioned above. 

We have noted earlier that computation of Bayes factor and posterior 
probability is difficult when parametric alternatives are not available. Many 
frequentist statisticians claim that P-values are valuable when there are no 
alternatives explicitly specified, as is common with tests of fit. We consider 
this issue here for a particularly common test of fit, the chi-squared test of 
goodness of fit. It will be observed that alternatives do exist implicitly, and 
hence Bayes factors and posterior probabilities can indeed be computed. The 
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following results from Delampady and Berger (1990) once again point out the 
discrepancies between P-values and Bayesian measures of evidence. 


Example 6.10. Let n = (n1, n2,..., nk) be a sample of fixed size N = ae nN; 
from a k-cell multinomial distribution with unknown cell probabilities p = 
(p1, P2,..-.; pk) and density (mass) function 


f(n|p) = -N j 
[Tin ni! i= 
Consider testing 
Ho : p = p° versus H; : p £ p°, 
where p° = (pf, p9,..., p?) is a specified interior point of the k-dimensional 


simplex. Instead of focusing on the exact multinomial setup, the most popular 
approach is to use the chi-squared approximation. Here the test statistic of 


interest is 

ma Ce NE 

= Np? 
which has the asymptotic distribution (as N — oo) of x7_, under Ho. To 
compare P-values so obtained with the corresponding robust Bayesian mea- 
sures of evidence, the following are two natural classes of prior distributions 


to consider. 
(i) The conjugate class Go of Dirichlet priors with density 


g(p) = [ 


k k 
sore aj) [pem 

k a ý 
= P (ai) i=1 


where a; > 0 satisfy 
l / g 0 
-ar 0era = E= p 


(ii) Consider the following transform of (p1, p2,- ..,pPk—1): 


22.30 _, — 7? 
os u(p) = (=a PY P2 a Bo 1 Pt 
pı P2 Pk—1 


! 


+ (ZPE) (Bi, VI V). 


The justification (see Delampady and Berger (1990)) for using such a trans- 
form is that its range is R”—! unlike that of p and its likelihood function is 
more symmetric and closer to a multivariate normal. Now let 


Gigp = {unimodal g*(u) that are spherically symmetric about 0}, 
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and consider the class of prior densities g obtained by transforming back to 
the original parameter: 





ANE fop) = g*(u(p)) ea i | 


Op 


Delampady and Berger (1990) show that as N — oo, the lower bounds on 
Bayes factors over Go and Grys converge to those corresponding with the 
multivariate normal testing problem (chi-squared test) in Example 6.8, thus 
proving that irreconcilability of P-values and Bayesian measures of evidence 
is present in goodness of fit problems as well. 


Additional discussion of the multinomial testing problem with mixture of 
conjugate priors can be found in Good (1965, 1967, 1975). Edwards et al. 
(1963) discuss the possibility of finding lower bounds on Bayes factors over 
the conjugate class of priors for the binomial problem. Extensive discussion 
of the binomial problem and further references can be found in Berger and 
Delampady (1987). 


6.3.4 Invariant Tests! 


A natural generalization of the symmetry assumption (on the prior distribu- 
tion) is invariance under a group of transformations. Such a generalization 
and many examples can be found in Delampady (1989a). A couple of those 
examples will be discussed below to show the flavor of the results. The gen- 
eral results that utilize sophisticated mathematical arguments will be skipped, 
and instead interested readers are referred to the source indicated above. For a 
good discussion on invariance of statistical decision rules, see Berger (1985a). 
Recall that the random observable X takes values in a space X and has 
density (mass) function f(x/@). The unknown parameter is 0 € O C R”, for 
some positive integer n. It is desired to test Hp : @ € Oo versus H; : 0 € Oj. 
We assume the following in addition. 
(i) There is a group G (of transformations) acting on ¥ that induces a group 
G (of transformations acting) on O. These two groups are isomorphic (see 
Section 5.1.7) and elements of G will be denoted by g, those of G by 9. 
(ii) f(gx|g@) = f(x|@)k(g) for a suitable continuous map k (from G to (0, 00)). 
(iii) GOo = Oo, GO: = 01, 9O = O. 


In this context, the following concept of a maximal invariant is needed. 


Definition. When a group G of transformations acts on a space X, a function 
T(x) on X is said to be invariant (with respect to G) if T(g(x)) = T(x) for 
all x € ¥ and g € G. A function T(z) is maximal invariant (with respect to 
G) if it is invariant and further 


T (x1) = T(x2) implies x; = g(x2) for some g E€ G. 


1 Section 6.3.4 may be omitted at first reading. 
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This means that G divides ¥ into orbits where invariant functions are 
constant. A maximal invariant assigns different values to different orbits. 

Now from (i), we have that the action of G and G induce maximal invariants 
t(X) on ¥ and 7(@) on O, respectively. 


Remark 6.11. The family of densities f(x|@) is said to be invariant under G if 
(ii) is satisfied. The testing problem Ho : 6 € Qo versus H; : 8 € O; is said to 
be invariant under G if in addition (iii) is also satisfied. 


Example 6.12. Consider Example 6.8 again and suppose X ~ N,(@,J). It is 
desired to test 
Ho : 0 = 0 versus Hı : 00. 


This testing problem is invariant under the group Go of all orthogonal trans- 
formations; i.e., if H is an orthogonal matrix of order p, then gyX = HX ~ 
N,(H@,I), so that G7@ = H8. Further, 


f(x|8) = (2)?! exp(—5(x — @)'(x — 0)), and 


F(anx\an8) = (2n)?/? exp(—> (Hx ~ H0)/(Hx — H0) 


1 
= (27) -?/? exp(—5 (x — 0)'(x — 0)) 
= f(x|8), 
so that (ii) is satisfied. Also, 970 = 0, and (iii) too is satisfied. 


Example 6.18. Let X1,Xo,---,Xn be a random sample from N (6, a”) with 
both @ and o unknown. The problem is to test the hypothesis Hp : 8 = 0 
against H; : 6 #0. A sufficient statistic for (0,0) is x = (X, S$), X = Soy Xi/n 


and S = [S (X; — X)?/n]'/". Then 
f(x|0,0) = Ko-"S”* exp(—n [(X — 0)? + S°] /(207)), 
where K is a constant. Also, 
A 2A Ss) ee hs 0}, and O = {(9,0):0 E R,c > 0}. 


The problem is invariant under the group G = {g. = c : c > 0}, where the 
action of ge is given by ge(£) = c(%,s) = (cz,cs). Note that f(g-z|@,0) = 
c fajba). 


A number of technical conditions in addition to the assumptions (i)—(iii) 
yield a very useful representation for the density of the maximal invariant 
statistic t(X). Note that this density, q(t(x)|n(8)), depends on the parameter 
@ only through the maximal invariant in the parameter space, 7(@). 
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The technique involved in the derivation of these results uses an averaging 
over a relevant group. The general method of this kind of averaging is due 
to Stein (1956), but because there are a number of mathematical problems 
to overcome, various different approaches were discovered as can be seen in 
Wijsman (1967, 1985, 1986), Andersson (1982), Andersson et al. (1983), and 
Farrell (1985). For further details, see Eaton (1989), Kariya and Sinha (1989), 
and Wijsman (1990). The specific conditions and proofs of these results can 
be found in the above references. In particular, the groups considered here are 
amenable groups as presented in detail in Bondar and Milnes (1981). See also 
Delampady (1989a). The orthogonal group, and the group of location-scale 
transformations are amenable. The multiplicative group of non-singular p x p 
linear transformations is not. 

Let us return to the issue of comparison of P-values and lower bounds 
on Bayes factors and posterior probabilities (of hypotheses) in this setup. 
We note that it is necessary to reduce the problem by using invariance for 
any meaningful comparison because the classical test statistic and hence the 
computation of P-value are already based on this. Therefore, the natural class 
of priors to be used for this comparison is the class G; of G-invariant priors; 
i.e., those priors 7 that satisfy 
(iv) 7(A) = m(Ag). 


Theorem 6.14. If G is a group of transformations satisfying certain regular- 
ity conditions (see Delampady (1989a)), 


inf t 
ee (x)|71) 


sup q(t(x)|n2)’ 
72€O1/G 


B(G;, £) = (6.25) 


where O/G denotes the space of maximal invariants on the parameter space. 


Corollary 6.15. If ©o/G = {0}, then under the same conditions as in The- 
orem 6.14, 
q (t(x)|0) 


RODES Ta eg I 


Example 6.16. (Example 6.12, continued.) Consider the class of all priors that 
are invariant under orthogonal transformations, and note that this class is 
simply the class of all spherically symmetric distributions. Now, application 
of Corollary 6.15 yields, 


_ atO) 
Bn) = FGA)’ 


where q(t|7) is the density of a noncentral x? random variable with p degrees 
of freedom and non-centrality parameter 7, and 7 is the maximum likelihood 
estimate of 7 from data t(x). For selected values of p the lower bounds, B and 
P (for to = 0.5) are tabulated against their P-values in Table 6.4. 
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Table 6.4. Invariant Test for Normal Means 


a=0.01 | a=0.05 
B- PB -E 


3 |.0749 .0697/.2913 .2256 


4 |.0745 .0693|.2903 .2250 





Notice that the lower bounds on the posterior probabilities of the null 
hypothesis are anywhere from 4 to 7 times as large as the corresponding 
P-values, indicating that there is a vast discrepancy between P-values and 
posterior probabilities. This is the same phenomenon as what was seen in 
Table 6.3. What is, however, interesting is that the class of priors considered 
here is larger and contains the one considered there, but the magnitude of the 
discrepancy is about the same. 


Example 6.17. (Example 6.13, continued.) In the normal example with un- 
known variance, we have the maximal invariants t(x) = %/s and 7(6,0) = 6/c. 
If we define, 


d 
Gr = {r : dr(0, 0) = hi (n)dn—, hı is any density for n}, 


we obtain, 
B(G; sT ) = a (t(x) 10) 
q (t(x)|7) 

where q(t\7) is the density of a noncentral Student’s t random variable with n— 
1 degrees of freedom, and non-centrality parameter 7, and 7 is the maximum 
likelihood estimate of 7. The fact that all the necessary conditions (which are 
needed to apply the relevant results) are satisfied is shown in Andersson (1982) 
and Wijsman (1967). For selected values of the lower bounds are tabulated 
along with the P-values in Table 6.5. 

For small values of n, the lower bounds in Table 6.5 are comparable with 
the corresponding P-values, whereas as n gets large the differences between 
these lower bounds and the P-values get larger. See also in this connection 
Section 6.4. 


There is substantial literature on Bayesian testing of a point null. Among 
these are Jeffreys (1957, 1961), Good (1950, 1958, 1965, 1967, 1983, 1985, 
1986), Lindley (1957, 1961, 1965, 1977), Raiffa and Schlaiffer (1961), Ed- 
wards et al. (1963), Hildreth (1963), Smith (1965), Zellner (1971, 1984), Dickey 
(1971, 1973, 1974, 1980), Lempers (1971), Rubin (1971), Leamer (1978), Smith 
and Spiegelhalter (1980), Zellner and Siow (1980), and Diamond and Forrester 
(1983). Related work can also be found in Pratt (1965), DeGroot (1973), 
Dempster (1973), Dickey (1977), Bernardo (1980), Hill (1982), Shafer (1982), 
and Berger (1986). 
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Table 6.5. Test for Normal Mean, Variance Unknown 


MCECC 





Invariance and Minimaxity 


Our focus has been on deriving bounds on Bayes factors for invariant testing 
problems. There is, however, a large literature on other aspects of invariant 
tests. For example, if the group under consideration satisfies the technical 
condition of amenability and hence the Hunt-Stein theorem is valid, then the 
minimax invariant test is minimax among all tests. We do not discuss these 
results here. For details on this and other related results we would like to refer 
the interested readers to Berger (1985a), Kiefer (1957, 1966), and Lehmann 
(1986). 


6.3.5 Interval Null Hypotheses and One-sided Tests 


Closely related to a sharp null hypothesis Ho : 8 = @ is an interval null 
hypothesis Ho : | — | < e. Dickey (1976) and Berger and Delampady (1987) 
show that the conflict between P-values and posterior probabilities remains if 
€ is sufficiently small. The precise order of magnitude of small € depends on 
the sample size n. 

One may also ask similar questions of possible conflict between P-values 
and posterior probabilities for one-sided null, say, Hp : 0 < ĝo versus Hi : 
0 > 0o. In the case of 6 = mean of a normal, and the usual uniform prior, 
direct calculation shows the P-value equals posterior probability. On the other 
hand, Casella and Berger (1987) show in general the two are not the same and 
the P-value may be smaller or greater depending on the family of densities 
in the model. Incidentally, the ambiguity of an improper prior discussed in 
Section 6.7 does not apply to one-sided nulls. In this case the Bayes factor 
remains invariant if the improper prior is multiplied by an arbitray constant. 


6.4 Role of the Choice of an Asymptotic Framework? 


This section is based on Ghosh et al. (2005). Suppose X1,..., Xn are i.i.d. 
N(@,¢7), o? known, and consider the problem of testing Ho : 0 = ĝo versus 


2 Section 6.4 may be omitted at first reading. 
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H, : 0 Æ ĝo. If instead of taking a lower bound as in the previous sections, we 
take a fixed prior density g1(@) under H, but let n go to oo, then the conflict 
between P-values and posterior probabilities is further enhanced. Historically 
this phenomenon was noted earlier than the conflict with the lower bound, 
vide Jeffreys (1961) and Lindley (1957). 

Let gı be a uniform prior density over some interval (0o — a, 99 + a) con- 
taining ĝo. The posterior probability of Hp given X = (Xj,...,Xn) is 


P(Ho|X) = mo exp[—n(X — 09)*/(207)]/K, 





where ro is the specified prior probability of Ho and 


= Daa eee 7 
K = moexp|-n(X — bo)? /(20°)] + ——-— | exp[—n(X — 0)*/(20°)|dé. 


2a Oo —a 


Suppose X is such that X = o + zag /,/n where zg is the 100(1 — a)% 
quantile of N (0,1). Then X is significant at level a. Also, for sufficiently large 
n, X is well within (09 — a, 09 +a) because X — bo tends to zero as n increases. 
This leads to 


Oota E 
/ exp[—n(X — 0)?/(207)|d8 =~ oy (2r/n) 


Qa 


and hence 


P(Ho| X) = mo exp(—z? /2)/ [70 exp(—z2 /2) + UT n/n) 


Thus P(Ho|X) —> 1 as n — œ whereas the P-value is equal to a for all n. 
This is known as the Jeffreys-Lindley paradox. It may be noted that the same 
phenomenon would arise with any flat enough prior in place of uniform. 

Indeed, P-values cannot be compared across sample sizes or across exper- 
iments, see Lindley (1957), Ghosh et al. (2005). Even a frequentist tends to 
agree that the conventional values of the significance level œ like @ = 0.05 or 
0.01 are too large for large sample sizes. This point is further discussed below. 

The Jeffreys-Lindley paradox shows that for inference about 0, P-values 
and Bayes factors may provide contradictory evidence and hence can lead 
to opposite decisions. Once again, as mentioned in Section 6.3, the evidence 
against Ho contained in P-values seems unrealistically high. We argue in this 
section that part of this conflict arises from the fact that different types of 
asymptotics are being used for the Bayes factors and the P-values. We begin 
with a quick review of the two relevant asymptotic frameworks in classical 
statistics for testing a sharp null hypothesis. 

The standard asymptotics of classical statistics is based on what are called 
Pitman alternatives, namely, 6, = 69 + d/./n at a distance of O(1/,/n) from 
the null. The Pitman alternatives are also called contiguous in the very general 
asymptotic theory developed by Le Cam (vide Roussas (1972), Le Cam and 
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Yang (2000), Hajek and Sidák (1967)). The log-likelihood ratio of a contiguous 
alternative with respect to the null is stochastically bounded as n > oo. On 
the other hand, for a fixed alternative, the log-likelihood ratio tends to —oo 
(under the null) or oo (under the fixed alternative). If the probability of Type 
1 error is 0 < a < 1, then the behavior of the likelihood ratio has the following 
implication. The probability of Type 2 error will converge to 0 < 8 < 1 under 
a contiguous alternative 0, and to zero if @ is a fixed alternative. This means 
the fixed alternatives are relatively easy to detect. So in this framework it is 
assumed that the alternatives of importance are the contiguous alternatives. 
Let us call this theory Pitman type asymptotics. 

There are several other frameworks in classical statistics of which Ba- 
hadur’s (Bahadur, 1971; Serfling, 1980, pp. 332-341) has been studied most. 
We focus on Bahadur’s approach. In Bahadur’s theory, the alternatives of im- 
portance are fixed and do not depend on n. Given a test statistic, Bahadur 
evaluates its performance at a fixed alternative by the limit (in probability or 
a.s.) of +(log P-value) when the alternative is true. 

Which of these two asymptotics is appropriate in a given situation should 
depend on which alternatives are important, fixed alternatives or Pitman al- 
ternatives 09 + d/,/n that approach the null hypothesis at a certain rate. 
This in turn depends on how the sample size n is chosen. If n is chosen to 
ensure a Type 2 error bounded away from 0 and 1 (like a), then Pitman al- 
ternatives seem appropriate. If n is chosen to be quite large, depending on 
available resources but not on alternatives, then Bahadur’s approach would 
be reasonable. 


6.4.1 Comparison of Decisions via P-values and Bayes Factors in 
Bahadur’s Asymptotics 


In this subsection, we essentially follow Bahadur’s approach for both P-values 
and Bayes factors. A Pitman type asymptotics is used for both in the next 
subsection. We first show that if the P-value is sufficiently small, as small as 
it is typically in Bahadur’s theory, Bo; will tend to zero, calling for rejection 
of Ho, i.e., the evidence in the P-value points in the same direction as that 
in the Bayes factor or posterior probability, removing the sense of paradox 
in the result of Jeffreys and Lindley. One could, therefore, argue that the P- 
values or the significance level a assumed in the Jeffreys-Lindley example are 
not small enough. The asymptotic framework chosen is not appropriate when 
contiguous alternatives are not singled out as alternatives of importance. 

We now verify the claim about the limit of Bp;. Without loss of generality, 
take ĝo = 0,07 = 1. First note that 


s 1 
log Bo. = =a XK? + 5 logn + Rp, (6.26) 


where i 
Rn = —log7(X|H:) — 3 log(27) + o(1) 
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provided the prior gi(@) is a continuous function of @ and is positive at all 6. 
If we omit Rn from the right-hand side of (6.26), we have Schwarz’s (1978) 
approximation to the Bayes factor via BIC (Section 4.3). 

The logarithm of P-value (p) corresponding to observed X is 


log p = log 2[1 — (Vn | X |)] = - 5 X7(1 + 0(2)) 


by standard approximation to a normal tail (vide Feller (1973, p. 175) or 
Bahadur (1971, p. 1)). Thus + logp + —6?/2 and by (6.26), log Bo, > —oo. 
This result is true as long as |X| > c(logn/n)'/?,¢ > V2. Such deviations are 
called moderate deviations, vide Rubin and Sethuraman (1965). Of course, 
even for such P-values, p ~ (Bo: /n) so that P-values are smaller by an order 
of magnitude. The conflict in measuring evidence remains but the decisions 
are the same. 

Ghosh et al. (2005) also pursue the comparison of the three measures of 
evidence based on the likelihood ratio, the P-value based on the likelihood 
ratio test, and the Bayes factor Bo] under general regularity conditions. 


6.4.2 Pitman Alternative and Rescaled Priors 


We consider once again the problem of testing Ho : 6 = 0 versus Hı: 6 4 0 
on the basis of a random sample from N(@,1). Suppose that the Pitman 
alternatives are the most important ones and the prior g|(@) under Hı puts 
most of the mass on Pitman alternatives. One such prior is N(0,6/n). Then 


n ô 5 
Bo = Vb +1 ——{—— } X*|}. 
TES 
If the P-value is close to zero, yn|X] is large and therefore, Bo; is also close 
to zero, i.e., for these priors there is no paradox. ‘The two measures are of the 
same order but the result of Berger and Sellke (1987) for symmetric unimodal 
priors still implies that P-value is smaller than the Bayes factor. 


6.5 Bayesian P-value? 


Even though valid Bayesian quantities such as Bayes factor and posterior 
probability of hypotheses are in principle the correct tools to measure the 
evidence for or against hypotheses, they are quite often, and especially in 
many practical situations, very dificult to compute. This is because either the 
alternatives are only very vaguely specified, vide (6.6), or very complicated. 
Also, in some cases one may not wish to compare two or more models but 
check how a model fits the data. Bayesian P-values have been proposed to 
deal with such problems. 


3 Section 6.5 may be omitted at first reading. 
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Let Mo be a target model, and departure from this model be of interest. 
If, under this ke X has density f(z|n), 7 € E, then for a Bayesian with 
prior 7 on nN, M(x) = fe f e f(x|n)(n) dn, the prior predictive distribution is the 
actual predictive pe ha. of X. Therefore, if a model departure statistic 
T(X) is available, then one can define the prior predictive P-value (or tail area 
under the predictive distribution) as 


= P™ (T(X) > T(t) Mo); 


where zobs is the observed value of X (see Box (1980)). Although it is true 
that this is a valid Bayesian quantity for model checking and it is useful in 
situations such as the ones described in Exercise 13 or Exercise 14, it does 
face the criticism that it may be influenced to a large extent by the prior m 
as can be seen in the following example. 


Example 6.18. Let X 1, X2,---,Xn be a random sample from N (0,07) with 
both @ and o? unknown. It is of interest to detect discrepancy in the mean of 
the model with the target model being Mo : 0 = 0. Note that T = /nX (ac- 
tually its absolute value) is the natural model departure statistic for checking 
this. 

(a) Case 1. It is felt a priori that ga? is known, or equivalently, we choose 
the prior on go”, which puts all its mass at some known constant o. Then 
under Mo, there are no unknown parameters and hence the prior predictive 
P-value is simply 2(1 — 6(,/n|Zons|/o0)), where Zobs is the observed value of 
X. This can highly overestimate the evidence against Mo if o2 happens to 
underestimate the actual model variance. 

(b) Case 2. Consider the usual non-informative prior on o 


Then, 
se fx(x|o*)1(07) do 


da? 
=n/2 M _ 
B: exp(— np) r? y 


f —n/2 
E” 
=I 


which is an improper density, thus completely disallowing computation of the 
prior predictive P-value. 

(c) Case 3. Consider an inverse Gamma prior I As B) a the following 
density for 07: m(a?|v, 8) = e (ay P exp(—< É) for o? > 0, where v and 
G are specified positive constants. Because T|a? ~ N(0,o7), under this prior 
the predictive density of T is then, 


t Gr) oe Tio. 





= | trto? B) do 
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Table 6.6. Normal Example: Prior Predictive P-values 


[27s i 2[5 1 2 





p|.300 .398 .506;.109 .189 .300}.017 .050 .122}.0001 .001 .011 


ore 1 t? ES 
x | exp(— 5 (68 + Zo") cna age 


x (28 + t) +D/2, 


If 2v is an integer, under this predictive distribution, 


~ top. 


Ne 
B/v 
Thus we obtain, 


pais (|X| = |Zobs|| Mo) 


Dix 4 VN|Zobs| 
mp (1a > ae e) 


=a (1- meal), 


where Fo, is the c.d.f. of to,. For VnĒZobs = 1.96 and various values of v and 
B, the corresponding values of the prior predictive P-values are displayed in 
Table 6.6. 

Further, note that p — 1 as G —> œ for any fixed v > 0. Thus it is clear 
that the prior predictive P-value, in this example, depends crucially on the 
values of v and £. 


What can be readily seen in this example is that if the prior 7 used is a poor 
choice, even an excellent model can come under suspicion upon employing the 
prior predictive P-value. Further, as indicated above, non-informative priors 
that are improper (thus making m, improper too) will not allow computing 
such a tail area, a further undesirable feature. To rectify these problems, 
Gutman (1967), Rubin (1984), Meng (1994), and Gelman et al. (1996) suggest 
modifications by replacing 7 in mz by m(n|Xops): 


i (Fi Lops) = [ seein (none) dn and 
p= jee (T(X) > T(Zobs)). 


This is called the posterior predictive P-value. This removes some of the 
difficulties cited above. However, this version of Bayesian P-value has come 
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under severe criticism also. Bayarri and Berger (1998a) note that these mod- 
ified quantities are not really Bayesian. To see this, they observe that there is 
“double use” of data in the above modifications: first to convert (a possibly 
improper) 7(7) into a proper 7(7|2op;), and then again in computing the tail 
area of T(X) corresponding with T (zobs). Furthermore, for large sample sizes, 
the posterior distribution of 7 will essentially concentrate at 7, the MLE of 
n, so that m*(£|£obs) will essentially equal f(2|7), a non-Bayesian object. In 
other words, the criticism is that, for large sample sizes the posterior predic- 
tive P-value will not achieve anything more than rediscovering the classical 
approach. Let us consider Example 6.18 again. 


Example 6.19. (Example 6.18, continued.) Let us consider the non-informative 
prior t(o*) x 1/o* again. Then, as before, because T|o? ~ N(0, 07), and 


i a 
t(a?|Xons) X expl- 5-7 X > 2?)(o?)- 34 
t=1 


nmo _n 
oc exp(— 595 (Eas + 8245))(02)- 341, 


the posterior predictive distribution of T is 


malt Kops) = f fr(tlo*)1(o7|xo5) da? 


do? 


= i? n 
2\—1/2 = 2\—n/2 _ -2 2 
a f (oP) exp(— 55) 0?) expl za (Ess + 525)) S 


i -2 2 21)\,,(n+1)/2 dv 
Ox f exp(—v{n(Zo55 a Sobs) +t })v aF 
1 42 —(n+1)/2 
ox (: mes as} R 
n Tobs T Sobs 
Therefore, we see that, under the posterior predictive distribution, 
T 
V x obs a Sobs 
Thus we obtain the posterior predictive P-value to be 


p = P™Mr(lobs) (LX| > |Fons||Mo) 
em prL |Xobs) ( À | > V 1 Zabal m) 


~ tn. 


V L obs + sobs y Lobs + Sobs 


E eealas 
V Tobs T Sobs 


where Fa is the distribution function of tn. This definition of a Bayesian 
P-value doesn’t seem satisfactory. Let |Zobs| — oo. Note that then p > 
2 (1 — F,(/n)). Actually, p decreases to this lower bound as |Zobs| —> co. 
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Table 6.7. Values of pn = 2 (1 — Fa (Vn) ) 


1 2 3 4 5 6 7 8 9 10 
-500 .293 .182 .116 .076 .050 .033 .022 .015 .010 





Values of this lower bound for different n are shown in Table 6.7. Note 
that these values have no serious relationship with the observations and hence 
cannot be really used for model checking. Bayarri and Berger (1998a) attribute 
this behavior to the ‘double’ use of the data, namely, the use of Z in computing 
both the posterior distribution and the tail area probability of the posterior 
predictive distribution. 


In an effort to combine the desirable features of the prior predictive P-value 
and the posterior predictive P-value and eliminate the undesirable features, 
Bayarri and Berger (see Bayarri and Berger (1998a)) introduce the conditional 
predictive P-value. This quantity is based on the prior predictive distribution 
My but is more heavily influenced by the model than the prior. Further, non- 
informative priors can be used, and there is no double use of the data. The 
steps are as follows: An appropriate statistic U(X), not involving the model 
departure statistic T(X)}, is identified, the conditional predictive distribution 
m(t|u) is derived, and the conditional predictive P-value is defined as 


Pe = Pees (T(X) > T(Zobs)) ; 


where tops = U(x 55). The following example is from Bayarri and Berger 


(1998a). 


Example 6.20. (Example 6.18, continued.) T = /nX is the model depar- 
ture statistic for checking discrepancy of the mean in the normal model. 
Let U(X) = s* = +5; (X; — XY. Note that nU|o? ~ o?x2_,. Consider 
n(a7) x 1/0? again. Then m(a?|U = s?) x (07)(*—)/2+1 exp(—ns?/(207)) is 
the density of inverse Gamma, and hence the conditional predictive density 


. 2 . 
of T given s*,. is 


Toe f ” fr(tlo?)a(o?]s2s) do? 


da? 


o2 


 (g2\-1/2 O )(g2)-(n-1)/2 ge 
ess = —{n— = 


S d 
* I exp(—v{nsi,, PE o 
0 


( 1 £ o 
X bee : 
N Sobs 


Thus, under the conditional predictive distribution, 
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n-1 T 
~ tn-1; 
Tt Sobs 


and hence we obtain the conditional predictive P-value to be 








De = P™ lob) (|X| > |Zovs|Mo) 
2 ——|X| __ -— | 
— P™L|Soba) ( n — mal > n — TEL imo) 
obs 


Sobs 


=2 (1-7, (Ea), 


Sobs 


We have thus found a Bayesian interpretation for the classical P-value from 
the usual t-test. It is worth noting that s*,, was used to produce the posterior 
distribution to eliminate o7, and that Zobs was then used to compute the tail 
area probability. It is also to be noted that in this example, it was easy to 
find U(X), which eliminates ø? upon conditioning, and that the conditional 
predictive distribution is a standard one. In general, however, even though 
this procedure seems satisfactory from a Bayesian point of view, there are 
problems related to identifying suitable U(X) and also computing tail areas 


from (quite often intractable) m(t|uops). 


Another possibility is the partial posterior predictive P-value (see Bayarri 
and Berger (1998a) again) defined as follows: 


p* = pr) (T(X) > T(£obs)), 


where the predictive density m* is obtained using a partial posterior density 
m* that does not use the information contained in tobs = T (£obs) and is given 
by 


mE) = | feltna) dn 
with the partial posterior m* defined as 


m™(n) xX fxir(Lobsltoos, n)n(n) 
x fx (£obs|N) 


frltos|n) i 


Consider Example 6.18 again with n(o?) « 1/ø?. Note that because X; are 
iid. N(0, g?) under Mo, 


Es N go: 
fx (Xovs|o7) x (o°) the exP(— 5-5 (Fovs 5 Sips) 


= a ee Ti 
«x fx(EZors|o7)(0?)—@ 9”? exp(— 55 8.65) 


so that 
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p iaa n 
fx| x (Xobs|Zobs, 7”) Ox (a°) (n=1)/2 exP(— 55 obs): 


Therefore, 
*/ 2 2\—(n—1)/2+1 m 2 
(i (o ) x (o ) ona exP(— 59 sobs)» 

and is the same as 7(o7|s?,,.), which was used to obtain the conditional predic- 
tive density earlier. Thus, in this example, the partial predictive P-value is the 
same as the conditional predictive P-value. Because this alternative version 
p* does not require the choice of the statistic U, it appears this method may 
be used for any suitable goodness-of-fit test statistic T. However, we have not 
seen such work. 


6.6 Robust Bayesian Outlier Detection 


Because a Bayes factor is a weighted likelihood ratio, it can also be used for 
checking whether an observation should be considered an outlier with respect 
to a certain target model relative to an alternative model. One such approach 
is as follows. Recall the model selection set-up as given in (6.1). X having 
density f(x|0) is observed, and it is of interest to compare two models Mo 
and M, given by 


Moy: X has density f(x|0) where 8 € Op; 
Mı : X has density f(|) where 0 € O4. 


For i = 1,2, gi(0) is the prior density of 0, conditional on M; being the 
true model. To compare Mo and Mı on the basis of a random sample x = 





(11,...,2,) the Bayes factor is given by 
mMo(x) 
Boi(x) = ; 
o1(x) m (x) 


where m;(x) = fo f(xl@)9:(0) d0 for i = 1,2. To measure the effect on the 
Bayes factor of observation xg one could use the quantity 


ka = log (| (6.27) 


where B(x_q) is the Bayes factor excluding observation xg. If kg < 0, then 
when observation xg is deleted there is an increase of evidence for Mo. Conse- 
quently, observation xg itself favors model M,. The extent to which xg favors 
M, determines whether it can be considered an outlier under model Mo. Sim- 
ilarly, a positive value for kg implies that xg favors Mọ. Pettit and Young 
(1990) discuss how kg can be effectively used to detect outliers when the prior 
is non-informative. ‘The same analysis can be done with informative priors 
also. This assumes that the conditional prior densities go and gı can be fully 
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specified. However, we take the robust Bayesian point of view that only cer- 
tain broad features of these densities, such as symmetry and unimodality, can 
be specified, and hence we can only state that go or gı belongs to a certain 
class of densities as determined by the features specified. Because kg, derived 
from the Bayes factor, is the Bayesian quantity of inferential interest here, 
upper and lower bounds on kg over classes of prior densities are required. 

We shall illustrate this approach with a precise null hypothesis. Then we 
have the problem of comparing 


Mo : 8 = Oo versus M, : 0 £ bo 


using a random sample from a population with density f(x|@). Under Mı, 
suppose @ has the prior density g, g € I’. The Bayes factors with all the 
observations and without the dth observation, respectively, are 


z f(x|8o) 
Baki Jozo, FŒ18)g(0) d8’ 
B a(x- a) = f(x- a|9o) 


Because f(x|9) = f(xq|@)f(x_al@), we get 
F100) Jong F(%e—al0)9(6) 48 
f(x—al6o) Sozo, $ (x|@)9(@) dé 


aa f(x_al0)9(@) dé 


kag = log Ae 


= log f (zaļo) — (6.28) 


Now note that to find the extreme values of ka,g, it is enough to find the 
extreme values of 


Jozo f(x|@)g(6 ) dé 
ae ee Z_q\9)g() dé 


over the set J’. Further, this optimization problem can be rewritten as follows: 


Jozo, f(ral?)f(z_al?)g(@) d0 


(6.29) 


sup ha,g = sup 


gEG ` gEG Jazon f(x_gl@)9(9) dé 
= sup f(xal@)g" (0) d0, (6.30) 
g*EG* J080 
inf h, = inf S026, f(xal@) f(z_al@)9(@) dé 
gG "I ge fg, F(a_al0)9(0) d0 


g* EG* 


inf J f(aal0)g* (0) d0, (6.31) 
6405 
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an OCH 
= -fo ENA m ec}. 


Equations (6.30) and (6.31) indicate how optimization of ratio of integrals 
can be transformed to optimization of integrals themselves. Consider the case 
where I is the class A of all prior densities on the set {0 : 0 4 09}. Then we 
have the following result. 


where 


Theorem 6.21. If f(x_,|@) > 0 for each 0 £ 80o, 
sup ha,g = sup f (xal), and (6.32) 
gEA 0400 


inf hjg = a f(zalð). (6.33) 
Proof. From (6.30) and (6.31) above, 


sup haos a hoe f (zalA@)g* (0) dé, 
gEA 


Sa an SOFE alb) 
i xt Na ee pee At. 


Now note that extreme points of A* are point masses. Proof for the infimum 
is similar. 


where 


The corresponding extreme values of kg are 
sup ka,g = log f(xa\00) — log inf f(xa|6), (6.34) 
geA G46 


inf kag = log f (zalo) — log sup f(zalð). (6.35) 
gEA O40 


Example 6.22. Suppose we have a sample of size n from N(6,07) with known 
o”. Then, from (6.34) and (6.35), 


1 
k = = 0 2 — 0 : 
A E A, E ae 
= œ, and 
= 1 2 2 
pas kag = 552 a (wa — 8)” — (xa — 60)" ] 
_ _ (zaz) 
207 


It can be readily seen from the above bounds on kg that no observation, 
however large in magnitude, will be considered an outlier here. This is because 
A is too large a class of prior densities. 
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Instead, consider the class G of all N(09, 77) priors with 7? > 72 > 0. Note 
that r? close to 0 will make Mı indistinguishable from Mọ, and hence it is 
necessary to consider T? bounded away from 0. Then for g € G 


-i Jozo, F(£-al8)g(0) 46 


= | f(xal0)9*(0) d0, 
06485 


where g* is the density of N(m, 6?) with 


(n — 1)r? o? 


2 at 
= Sj ee E E E E 
m= M(t aT) (nm— 1)? +o?" or (n —1)r? +02 © 
22 
5? = 6? 2) = 
(Za) (n-—1)7r? +0? 


Note, therefore, that ha = ha,g(ra) is just the density of N(m, o? + 67) 
evaluated at xg. Thus, 

EE Ore ee aT T = 

dg = (270°) ( + fet bo?! 

(n-1)7? = ogy 
x exp (fa ~ a-ioa t-d — qai) Ho? 70) 
a _. 
207(1 + maira) 


For each za, one just needs to graphically examine the extremes of the ex- 
pression above as a function of r* to determine if that particular observation 
should be considered an outlier. Delampady (1999) discusses these results and 
also results for some larger nonparametric classes of prior densities. 


6.7 Nonsubjective Bayes Factors“ 


Consider two models Mo and M, for data X with density f;(a|0;) under 
model M;, 0; being an unknown parameter of dimension p;,i = 0,1. Given 
prior specifications g,;(@;) for parameter @;, the Bayes factor of M, to Mo is 


obtained as 
_ mila) _ J h20) (01): 
0 — molz) f fo(al@o)go(80)d00 


Here m;(a) is the marginal density of X under M;, i = 0,1. When subjective 
specification of prior distributions is not possible, which is frequently the 
case, one would look for automatic method that uses standard noninformative 
priors. 





(6.36) 


t Section 6.7 may be omitted at first reading. 
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There are, however, difficulties with (6.36) for noninformative priors that 
are typically improper. If g; are improper, these are defined only up to arbi- 
trary multiplicative constants c;; cig; has as much validity as g;. This implies 
that (c;/co)Big has as much validity as Big. Thus the Bayes factor is de- 
termined only up to an arbitrary multiplicative constant. This indeterminacy, 
noted by Jeffreys (1961), has been the main motivation of new objective meth- 
ods. We shall confine mainly to the nested case where fg and fı are of the 
same functional form and fo{x|@9) is the same as f,(x|@,) with some of the 
co-ordinates of 8, specified. However, the methods described below can also 
be used for non-nested models. 

It may be mentioned here that use of diffuse (flat) proper prior does not 
provide a good solution to the problem. Also, truncation of noninformative 
priors leads to a large penalty for the more complex model. An example fol- 
lows. 


Example 6.23. (Testing normal mean with known variance.) Suppose we ob- 
serve X = (X1,..., Xn). Under Mo, X; are i.i.d. N(0,1) and under Mı, X; 
are i.i.d. N(@,1), 0 € Ris the unknown mean. With the uniform noninforma- 
tive prior gi’ (0) = c for 0 under M;, the Bayes factor of Mı to Mp is given 
by 

BY = V2ren"/? exp[nX?/2]. 


If one uses a uniform prior over —K < 6 < K, then for large K, the new 
Bayes factor B/S is approximately 1/(2Kc) times BA. Thus for large K, one 
is heavily biased against M,. This is reminiscent of the phenomenon observed 
by Lindley (1957). A similar conclusion is obtained if one uses a diffuse proper 
prior such as a normal prior N (0,77), with variance 7° large. The correspond- 
ing Bayes factor is 


nT? 


1 = 
Dors — (nr? + b er exp $ nn] 


which is approximately (nr?)~!/2 exp[n.X?/2] for large values of nr? and hence 
can be made arbitrarily small by choosing a large value of 77. Also BHOT™ is 
highly non-robust with respect to the choice of 7?, and this non-robustness 
plays the same role as indeterminacy. The expressions for Bi), and BROT! 
clearly indicate similar behavior of these Bayes factors and the similar roles 


of V2rce and (T? +.1/n)71/2. 


A solution to the above problem with improper priors is to use part of the 
data as a training sample. The data are divided into two parts, X = (Xj, X2). 
The first part Xj, is used as a training sample to obtain proper posterior 
distributions for the parameters (given X 1) starting from the noninformative 


= (X118:)g:(0:) 
— Sis) 9: (8: a 
gi(9;|X1) = T F(X 110:)9:(0;) dO,’ +220, 1. 
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These proper posteriors are then used as priors to compute the Bayes factor 
with the remainder of the data (X2). This conditional Bayes factor, condi- 
tioned on X1, can be expressed as 


J fo(X2|@0)90(A0|X1)dOo 
_ im(X) J fo(X1100)a0(80)400 
mo(X) f fi(X1101)91(01)d01 


Byo(X1) = 





(6.37) 


where m;(X 1) is the marginal density of Xı under M;,2 = 0,1. Note that if 
the priors cigi, i = 0,1, are used to compute Bi9(X 1), the arbitrary constant 
multiplier c;/co of Bio is cancelled by (co/c1) of mo(X1)/m1(X1) so that the 
indeterminacy of the Bayes factor is removed in (6.37). 

A part of the data, X1, may be used as a training sample as described 
above if the corresponding posteriors g;(6;|X 1), i = 0,1 are proper or, equiv- 
alently, the marginal densities m;(X,) of X, under M;,7 = 0,1 are finite. 
One would naturally use minimal amount of data as such a training sample 
leaving most part of the data for model comparison. As in Berger and Peric- 
chi (1996a), a training sample X, may be called proper if 0 < m;(Xj) < oo, 
i = 0,1 and minimal if it is proper and no subset of it is proper. 


Example 6.24. (Testing normal mean with known variance.) Consider the 
setup of Example 6.23 and the uniform noninformative prior g;(@) = 1 for 
6 under Mı. The minimal training samples are subsamples of size 1 with 
mo(Xi) = (1/V2m)e7*?/2 and mı (X:) = 1. 


Example 6.25. (Testing normal mean with variance unknown.) Let X = 
(Xie tk gy) 

Mo : Xi,.--,Xn are iid. N(0,0@), 

M,: X1,...,Xn are iid. N(p, 02). 

Consider the noninformative priors go9(0o) = 1/00 under Mo and gi(f4,01) = 
1/o,. Here m,(X;) = œ for a single observation X; and a minimal training 
sample consists of two distinct observations X;, X; and for such a training 
sample (X;, X;), 


1 


Xi, & m. 


and my(X;, X;) = (6.38) 


— m(X? + X?) 


6.7.1 The Intrinsic Bayes Factor 


As described above, a solution to the problem with improper priors is ob- 
tained using a conditional Bayes factor Bio(-X 1), conditioned on a training 
sample X,. However, this conditional Bayes factor depends on the choice of 
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the training sample X,. Let X(D, l = 1,2,...L be the list of all possible 
minimal training samples. Berger and Pericchi (1996a) suggest considering 
all these minimal training samples and taking average of the corresponding 
L conditional Bayes factors Byo(X (1))’s to obtain what is called the intrinsic 
Bayes factor (IBF). For example, taking an arithmetic average leads to the 
arithmetric intrinsic Bayes factor (AIBF) 


L 
ABF a= Bo? > x (6.39) 


and the geometric average gives the geometric intrinsic Bayes factor (GIBF) 


SS) 
m(X (2) 


the sum and product in (6.39) and (6.40) being taken over the L possible 
training samples X (J), = 1,..., L. 

Berger and Pericchi (1996a) also suggest using trimmed averages or the 
median (complete trimming) of the conditional Bayes factors when taking an 
average of all of them does not seem reasonable (e.g., when the conditional 
Bayes factors vary much). AIBF and GIBF have good properties but are af- 
fected by outliers. If the sample size is very small, using a part of the sample 
as a training sample may be impractical, and Berger and Pericchi (1996a) 
recommend using expected intrinsic Bayes factors that replace the averages 
in (6.39) and (6.40) by their expectations, evaluated at the MLE under the 
more complex model Mı. For more details, see Berger and Pericchi (1996a). 
Situations in which the IBF reduces simply to the Bayes factor Bio with re- 
spect to the noninformative priors are given in Berger et al. (1998). The AIBF 
is justified by the possibility of its correspondence with actual Bayes factors 
with respect to “reasonable” priors at least asymptotically. Berger and Peric- 
chi (1996a, 2001) have argued that these priors, known as “intrinsic” priors, 
may be considered to be natural “default” priors for the testing problems. 
The intrinsic priors are discussed here in Subsection 6.7.3. 


GIBFi = Bio (TI (6.40) 


6.7.2 The Fractional Bayes Factor 


O’Hagan (1994, 1995) proposes a solution using a fractional part of the full 
likelihood in place of using parts of the sample as training samples and av- 
eraging over them. The resulting “partial” Bayes factor, called the fractional 
Bayes factor (FBF), is given by 


= m1(X, b) 
F B Fio = mo(X.b) 


where 6 is a fraction and 
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J f:i(X|0:)g:(0;) d0; 
J (fi(X|0:)]°9:(0;)d0; 


Note that FB Fio can also be written as 


mi(X, b) = 





_ p mX) 
FBF io = B 5(X) 
where 
mi(X) = [ [fs(X16,)!9i(8,)a8;, i = 0,1. (6.41) 


To make FBF comparable with the IBF, one may take b = m/n where m 
is the size of a minimal training sample as defined above and n is the total 
sample size. O’Hagan also recommends other choices of b, e.g., b = /n/n or 
log n/n. 

We now illustrate through a number of examples. 


Example 6.26. (Testing normal mean with known variance.) Consider the 
setup of Example 6.23. The Bayes factor with the noninformative prior 
gi(@) = 1 was obtained as 


Bio = V20n7-/? exp[n.X?/2] = V2nn7 1/7 Aio 
where ào is the likelihood ratio statistic. Bayes factor conditioned on X; is 
Byo( Xi) = Bio mo(Xi)/m1 (Xi) = Byo(1/V 27) exp(—X?/2). 
Thus 


AIBFio = n`! ` Byo(X;) = n-9/? exp(nX?/2) X exp(—X?/2), 


i=1 t=1 


GIBF,) = n~'/? exp[nX?/2 — (1/2n) er 


The median IBF (MIBF) is obtained as the median of the set of values 
Byo( Xj), ‘= 1, 2 ete Tie 
The FBF with a fraction o < b < 1 is 


FBFo = b'/? exp[n(1 — b) X? /2] 
= n71? exp|(n — 1)X?/2], if b= 1/n. 


Example 6.27. (Testing normal mean with variance unknown.) Consider the 
setup of Example 6.25. For the standard noninformative priors considered in 
this example, we have 
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Bio = us x Ien ? a X?)n/? 
no r) DA: -i 
1 X: — X;| 
AIBFig = Bio xX — oS 
m(5) P23 (X? +X?) 


r(2z+) De 2 
FBF\) = —— 2 Sel a ith b= —. 
on Fr ER), 


Example 6.28. (normal linear model.) This example is from Berger and Peric- 
chi (1996b, 2001). Berger and Pericchi determined the IBF for linear models 
for both the nested and non-nested case. We consider here only the nested 
case. Suppose for the data Y(n x 1) we consider the linear models 


M; : Y = X;8;+ €i, €~N,(0,07In), i=0,1 


where 3; = (Bin, Gi2,---, Bip, )/ and o? are unknown parameters, and X; is an 
n X pi known design matrix of rank p; < n. Consider priors of the form 


90105) =; ore 2 Gir aE; 





Here q; = 0 gives the reference prior of Berger and Bernardo (1992a), and 
qi = p; corresponds with the Jeffreys prior. For the nested case, when Mp is 
nested in Mı, Berger and Pericchi (1996b) consider a modified Jeffreys prior 
for which gg = 0 and q, = pı — po. The integrated likelihoods m;(Y) with 
these priors can be obtained as 


m(¥) = 02/272 ((n — pi + 43) /2) [XIX 7 R7 Pt? 


where Č is a constant not depending on 7, and R; is the residual sum of 
squares under M;, i = 0,1. The Bayes factor Big with the modified Jeffreys 
prior is then given by 


XL Xo|!/2 / Ro (n—po)/2 
Bia = (27) -p0)/2 X 0X0" z) ; 6.42 
10 = (27) |X X,|1/2 aA ( ) 


Also, one can see that a minimal training sample Y (J) in this case is a sample 
of size m = pı+1 such that for the corresponding design matrices X ;{1) (under 
Mi), X;(1)X;,(l), i = 0,1 are nonsingular. The ratio mo(Y (1))/m1(Y¥ (1)) can 
be obtained from the expression of Big by inverting it and replacing n, Xo, 
X 1, Ro, and Rı by m, Xo(l), X1(1), Ro{l), and R, (1), respectively, where 
R;(1) is the residual sum of squares corresponding to (Y (/)) under M;, i = 0,1. 
Thus the conditional Bayes factor Big(Y (1)), conditioned on Y (J) is given by 


Bio(¥ (1)) 
E Kx T PAOLO oe 
bee ele Ry OXO ; 
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One may now find an average of these conditional Bayes factors to find an 
IBF. For example, an arithmetic mean of Bio(Y(1))’s for all possible minimal 
training samples Y (1)’s gives the AIBF, and a median gives the median IBF. 

In case of fractional Bayes factor, one obtains that (see, for example, Berger 
and Pericchi, 2001, page 152), with m?(X) as defined in (6.41), 


mb (X) z b (pı—po)/2 EAN Ry (m—po)/2 
1/2 \ Ro 


mi(X) \27 |x‘, Xo 





27 


with b = m/n and hence 
FBFyo = b'?1—P0)/2( Ry / R) 0T, 
See also O’Hagan (1995) in this context. 


For more examples, see Berger and Pericchi (1996a, 1996b, 2001) and 
O’Hagan (1995). 

Several other methods have been proposed as solutions to the problem with 
noninformative priors. Smith and Spiegelhalter (1980) and Spiegelhalter and 
Smith (1982) propose the imaginary minimal sample device; see also Ghosh 
and Samanta (2002b) for a generalization. Berger and Pericchi (2001) present 
comparison of four methods including the IBF and FBF with illustration 
through a number of examples. Ghosh and Samanta (2002b) discuss a unified 
derivation of some of the methods that shows that in some qualitative and 
conceptual sense, these methods are close to each other. 


6.7.3 Intrinsic Priors 


Given a default Bayes factor such as the IBF or FBF, a natural question is 
whether it corresponds with an actual Bayes factor based on some priors at 
least approximately. If such priors exist, they are called intrinsic priors. A de- 
fault Bayes factor such as IBF can then be calculated as an actual Bayes factor 
using the intrinsic prior, and one need not consider all possible training sam- 
ples and average over them. A “reasonable” intrinsic prior that corresponds 
to a naturally developed good default Bayes factor may be considered with be 
a natural default prior for the given testing or model selection problem. On 
the other hand, a particular default Bayesian method may be evaluated on 
the basis of the corresponding intrinsic prior depending on how “reasonable” 
the intrinsic prior is. Berger and Pericchi (1996a) describe how one can obtain 
intrinsic priors using an asymptotic argument. We begin with an example. 


Example 6.29. (Example 6.26, continued.) Suppose that for some proper prior 
7(@) under model Mı, 
BF, = AIBFio (6.43) 
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where BFY, denotes the Bayes factor based on a prior 7(@) under Mı. Using 
Laplace approximation (Section 4.3) to the integrated likelihood under Mı, 


we have . 
A(X) 12,4 ie 
BF, S 2 (0) 2r(det f)! 
fo(X|0 = 0)” 
where @ is the MLE of 6 under M ,, and I is the observed Fisher information 
number. Thus using the expression for the AIBF in this example, and noting 
that I = 1, (6.43) can be expressed as 


m(6) = (1/V an) — > exp(—X?/2). 


As the RHS converges to (1//27)Eg(e~*1/?) = (1/V/2r)(1/V2)e~® /4 with 
probability one under any @, this suggests the intrinsic prior 


(0) = exp(—07/4) 


1 
J 2rV/2 
which is a N(0,2) density. One can easily verify that 


with probability one under any @, i.e., the AIBF is approximately the same 
as the Bayes factor with an N(0,2) prior for 0 under M;. 

If one considers the FBF, one can directly show that the FBF, with fraction 
b, is exactly equal to the Bayes factor with a N (0, (=+ — 1)/n) prior. 


Let us now consider the general case. Let Big be the Bayes factor of Mı to 
Mp with noninformative priors g;(@;) for 0; under M;, i = 0,1. We illustrate 
below with the AIBF. Treatment for the other IBFs and FBF will be similar. 
Recall that 


= - 1 A mo(X(1)) 
AIBF\o = BioB here Bo, = 
10 10401 where Do1 £m D) 
Suppose for some priors 7; under M;, i = 0,1, AI B Fio is approximately equal 
to the Bayes factor based on 7 and 7, denoted B Fio(ro, 71). Using Laplace 
approximation (Section 4.3) to both the numerator and denominator of Bio 
(see 6.36), AI B Fio can be shown to be approximately equal to 


~ x ~ Boi (6.44) 

fo( X |@0)90(80) (27 /n)P0/?|Io|-1/2 
where n denotes the sample size, p; is the dimension of 0;, ĝ; is the MLE of 
@;, and J; is the observed Fisher information matrix under M;, i = 0,1. The 
same approximation applied to BF 9(70, 71), yields the approximation 
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fi(X|O1)71( 
fo(X|8o)70f 
to BF i9(70,71). We assume that conditions for the Laplace approximation 


hold for the given models. 
To find the intrinsic priors, we equate (6.44) with (6.45) and this yields 


(2n/n)Pr/2|f|-4/? 


1) 
0) (2m /n)Po/2| Io|-1/2 (6.45) 


ô 
6 


™1(81)90(80) ~ 
™o(A0)91(41) 


Berger and Pericchi (1996a) obtain the intrinsic prior determining equations 
by taking limits on both sides of (6.46) under Mo and M;. Assume that, as 
n —> o, 

under Mı, 6.3 61, ĝo > a(0ı), and Bo, > B+ (01); 

under Mo, ĝo =) 80o, 6, => b(8), and Boi —> Bo (8o). 
The equations obtained by Berger and Pericchi (1996a) are 


Boi. (6.46) 


™1(81)g0(@(A1)) _ ps ™1(b(Ao))90(90) _ px 
nn = B (0) and ——————_ = By (Go). 6.47 
p(l) OD 8° Boro) A 
When Mo is nested in Mı, Berger and Pericchi suggested the solution 
(90) = go(80), 7™1(81) = g1 (81)B7 (81). (6.48) 


However, this may not be the unique solution to (6.47). See also Dmochowski 
(1994) in this context. 


Example 6.30. (Example 6.27, continued.) A solution to the intrinsic prior 
determining equations suggested by Berger and Pericchi (see (6.48)) is 


1 


1 * 
7 (0) = oa (fl, 1) = a (i, 01) (6.49) 


where 
|X. — Xo| 
B: = Ea o, Dott X1, X d Boi( X1, X2) = — 5. 
i (4,01) = Ep,o, Boi(X1, X2) and Boi( X1, X2) (XFA 
Note that Boı(Xı, X2) can be expressed as 


z1/2 A2 
Bo (Xi, X2) = E ee 
o1( : 2) (Zi + Z2) nOi 

where Z; = (X1 — X2)?/(20?) ~ x? and Zp = (Xı + X2)?/(207) ~ noncentral 
x? with d.f. = 1 and noncentrality parameter \ = 24? /o?}. Also, Zı and Z2 
are independent. Using the representation of a noncentral x? density as an 
(infinite) weighted sum of central x? densities, we have 
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Zi A a2 (A/2)? Zt 
es) ee E Aa p ee a 
j z i de il Pe (20) 


j=0 


where W; ~ x? +9; and is independent of Z1. We then have 


i faites eo, CN) 
By (u, 01) = ——— A Dh = 2? /o? 


and the intrinsic priors are given by 


1 1 
™o(o9) = ae W1(ft,01) = 5, Mules) 


__ i 27 2) o (u aie 
with talon) = ae elu lot) Hs hy 
It is to be noted that f°). mı (ulo1)du = 1. 


Example 6.31. (Testing normal mean with variance unknown.) This is from 
Berger and Pericchi (1996a). Consider the setup of Example 6.25 with the 
same prior gg under Mp but in place of the standard noninformative prior 
gı(u, c1) = 1/01 use the Jeffreys prior gf (u,01) = 1/02. In this case, a mini- 
mal training sample consists of two distinct observations X;, X; for which 


1 


~~ Fi XS 


1 
— s and m,(X;,X;) = 
2n(X? + X#) : 


Proceeding as in the previous example, noting that 


mo(X1, X2) Zi 


mı(Xı, Xə) 7 Ja Zy + Za) 
where Zı and Z2 are as above, and using (6.50), the intrinsic priors are ob- 
tained as 


E D. = 1 1 exp(—p?/o?) 
Toleo) iaa To’ Tilu, 01) m m m1 (u|o1) ga = 2/T(u?/o) : 


Here mı (|01) is a proper prior, very close to the Cauchy (0, c1) prior for p, 
which was suggested by Jeffreys (1961) as a default proper prior for u (given 
01); see Subsection 2.7.2. 


Example 6.32. Consider a negative binomial experiment; Bernoulli trials, each 
having probability 0 of success, are independently performed until a total of n 
successes is accumulated. On the basis a the outcome of this experiment We 
want to test the null hypothesis Ho : 6 = l against the alternative Hı : 0 Zi A 

We consider this problem as choosing bone the two models Mọ : 6 = 5 
and Mı = 0 € (0, 1). 
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The data may be looked upon as n observations X1,..., Xn where X, 
denotes number of failures before the first success, and for i = 2,---,n, X; 
denotes number of failures between (i — 1)th success and ith success. The 
random variables X1,...,X, are i.i.d. with a common geometric distribution 
with probability mass function 


PX HoH Fb 8), oS 0, 152552 
The likelihood function is 
f(X1,...,Xp|0) = 0% * (1 — 6)”. 
We consider the Jeffreys prior 
g(6) = 071? (1-0), 0<0<1 
which is improper. The Bayes factor with this prior is 
1 
Bi waa gd, X:=1/2(1 — gyn-1 gg. 
0 
Minimal training samples are of size 1, and the AIBF is given by 


big we OX Sel 
AIB Fo = Bio x nde QR 


Let 


: 2Xi4+1 — (22 +1), 


Then the intrinsic prior is 


a(0) = 6-1/2(1 — 6)-1 B*(6) = 72 Bare ga, 


aia 


Simplification yields 
(6) = (0712 + 64/2 /2)(2 — 6)-? 


We now consider a simple example from Lindley and Phillips (1976), also 
presented in Carlin and Louis (1996, Chapter 1). In 12 independent tosses 
of a coin, one observes 9 heads and 3 tails, the last toss yielding a tail. It 
is shown that one gets different results according to a binomial or a negative 
binomial likelihood. Let us consider the problem of testing the null hypothesis 
Ho : 6 = 1/2 against the alternative Hı : 6 # 1/2 where @ denotes the 
probability of head in a trial. If a binomial model is assumed, the random 
observable X is the number of heads observed in a fixed number of 12 tosses. 
One rejects Ho for large values of the statistic |X — 6|, and the corresponding 
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P-value is 0.150. On the other hand, if a negative binomial model is assumed, 
the random observable X is the number of heads before the third trial appears. 
Note that expected value of X under Hp is 3. Suppose one rejects Ho for large 
values of |X — 3|. Then the corresponding P-value is 0.0325. Thus with the 
usual 5% Type 1 error level, the two model assumptions lead to different 
decisions. Let us now use a Bayes test for this problem. For the binomial 
model, the Jeffreys prior is proportional to 9~1/2(1 — @)~1/2, which can be 
normalized to get a proper prior. For the negative binomial model, the data 
can be treated as three i.i.d. geometrically distributed random variables, as 
described above. The Bayes factor under the binomial model (with Jeffreys 
prior) and the Bayes factor under the negative binomial model (with the 
intrinsic prior) are respectively 1.079 and 1.424. They are different as were 
the P-values of classical statistics, but unlike the P-values, the Bayes factors 
are quite close. 


6.8 Exercises 


1. Assume a sharp null and continuity of the null distribution of the test 
statistic. 
(a) Calculate Ey,(P-value) and Ey, (P-value|P-value < a), where 0 < 
a <1 is the Type 1 error probability. 
(b) In view of your answer to (a), do you think 2(P-value) is a better 
measure of evidence against Hp than P-value? 

2. Suppose X ~ N(@,1) and consider the two hypothesis testing problems: 


Ho: 80 = —1 versus H,:6@=1; 
Hj :@=1 versus Hy :0 = —1. 


Find the Bayes factor of Ho relative to Hı and that of Hğ relative to HÌ 
if (a) x = 0 is observed, and (b) x = 1 is observed. Compute the classical 
P-value in both cases. 

3. Refer to Example 6.3. Take 7 = 20, but keep the other parameter val- 
ues unchanged. Compute Bo, for the same values of t and n as used in 
Table 6.1. 

4. Suppose X ~ N(@,1) and consider testing 


Ho:@=0 versus Hı:0 #0. 


For three different values of x, x = 0,1,2, compute the upper and lower 
bounds on Bayes factors when the prior on 0 under the alternative hy- 
pothesis lies in 

(a) r4 = {all prior distributions on R}, 

(bine NO) SO. 

(c) Ts = {all symmetric (about 0) prior distributions on Ry}, 


200 6 Hypothesis Testing and Model Selection 


(d) Isy = {all unimodal priors on R, symmetric about 0}. 
Compute the classical P-value for each x value. What is the implication 
of Iy C Isy C Is C La? 

5. Let X ~ B(m,0), and let it be of interest to test 


Hy: 0 = 5 versus Hy: #5. 


If m = 10 and observed data is x = 8, compute the upper and lower 
bounds on Bayes factors when the prior on 0 under the alternative hy- 
pothesis lies in 
(a) Ta = {all prior distributions on (0, 1)}, 
(b) Ip = {Beta(a, a), a > 0}, 
(c) ITs = {all symmetric (about 4) priors on (0, 1)}, 
(d) sy = {all unimodal priors on (0,1), symmetric about 5}. 
Compute the classical P-value also. 

6. Refer to Example 6.7. 
(a) Show that B(G4,2) = exp(—5), P(Ho|Ga,x) = [14157 exp($)] >". 
(b) Show that, if t < 1, B(Gus,zr) = 1, and P(Ho|Gusg, £) = no. 
(c) Show that, ift < 1, B(Gnor, x) = 1, and P(Ho|Gnor, £) = To. Ift > 1, 
B(G Nor, x) B texp(—(t? = 1)/2). 

7. Suppose X|@ has the t,(3, 0, Ip) distribution with density 


? 


—(3+p)/2 
f(x|@) x (i + 5 (x — 0)'(x — 0)) 


and it is of interest to test Ho : 8 = O versus H; : 8 Æ 0. Show that this 
testing problem is invariant under the group of all orthogonal transforma- 
tions. 
8. Refer to Example 6.13. Show that the testing problem mentioned there is 
invariant under the group of scale transformations. 
9. In Example 6.16, find the maximal invariants in the sample space and the 
parameter space. 
10. In Example 6.17, find the maximal invariants in the sample space and the 
parameter space. 
11. Let X|@ ~ N(@,1) and consider testing 


Ho : |8 — o| < 0.1 versus Hy : |@ — o| > 0.1. 


Suppose x = ĝo + 1.97 is observed. 
(a) Compute the P-value. 
(b) Compute Bp; and P(Ho|x) under the two priors, N (80, 7?), with T? = 
(0.148)? and U(@) — 1, 8o + 1). 
12. Let X|p ~ Binomial(10, p). Consider the two models: 


1 1 
Mo : p = 5 versus Mi :p# z 


13. 


14. 
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Under Mı, consider the following three priors for p: (i) U(0,1), (ii) 
Beta(10, 10), and (iii) Beta(100, 10). If four observations, x = 0, 3, 5, 
7, and 10 are available, compute kg given in Equation (6.27) for each ob- 
servation, and for each of the priors and check which of the observations 
may be considered outliers under Mo. 

(Box (1980)) Let X1, X2,---,X, be a random sample from N (0,07) with 
both @ and g? unknown. It is of interest to detect discrepancy in the 
variance of the model with the target model being 


Mo: 07 = of, and 0 ~ N(u, T°), 


where u and 7? are specified. 

(a) Show that the predictive distribution of (X1, X2,---, Xn) under Mo is 
multivariate normal with covariance matrix ofn + T°11’ and E(X;) = p, 
tortie kh Zra 

(b) Show that under this predictive distribution, 


-a l Da 2 
= yw KI 2a Ge ~ XxX. 


(c) Derive and justify the prior predictive P-value based on the model 
departure statistic T(X). Apply this to data, x = (8,5,4,7), and oĝ = 1, 
= a 2. 

(c) What is the classical P-value for testing Ho : o? = oĉ in this problem? 
(Box (1980)) Suppose that under the target model, for 7 = 1,2,...,n, 


yi:|B0,9,07 = Bo + x10 + ciei ~ N(0,07) iid., 

Bolo? ~ N(po, ca”), Olo? ~ N,(O0, 07L), 

o* ~ inverse Gamma(a, y), 

where c, Ho, Oo, T, a and y are specified. Assume the standard linear 
regression model notation of y = 891+ X@+e, and suppose that X’1 = 0. 
Further assume that, given o*, conditionally 8o, @ and e are independent. 
Also, let 8) and 6 be the least squares estimates of Jo and @, respectively, 
and RSS = (y — ĝo1 — X6)'(y — Bol — X8). 
(a) Show that under the target model, conditionally on 07, the predictive 
density of y is proportional to 





(e exp(— z (Ê + RSS 


+Ê — Bo) (XX) + r7) — 80)) J. 


(b) Prove that the predictive distribution of y under the target model is 
a multivariate t. Pi 
(c) Show that the joint predictive density of (RSS, 0) is proportional to 
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15. 


16. 


17. 
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we on —(n+a—1)/2 
{27 + RSS + (6 — 0o)'((X'X)7} +I-1)1(6 — 0o) } 


(d) Derive the prior predictive distribution of 


_ (8 = 90) (XX) + r- = 80) 


Ty) 27+ RSS 


(e) Using an appropriately scaled T(y) as the model departure statistic 
derive the prior predictive P-value. 

Consider the same linear regression set-up as in Exercise 14, but let the 
target model now be 


Mo : 8 = 0, Bolo” ~ N (po, co”), o? ~ inverse Gamma(a,7). 
Assuming y to be close to 0, use 
6 X'X6 
= RSS 


as the model departure statistic to derive the prior predictive P-value. 

Compare it with the classical P-value for testing Ho : 0 = 0. 

Consider the same problem as in Exercise 15, but let the target model be 
1 


Mbo : 0 = 0, Bolo? ~ N (p0, co”), n(o?) x = 


T(y) 


Using T(y) = Ô X’X6 as the model departure statistic and RSS as the 
conditioning statistic, derive the conditional predictive P-value. Compute 
the partial predictive P-value using the same model departure statistic. 
Compare these with the classical P-value for testing Ho : 0 = 0. 

Let X1,X2,:--, Xn be i.i.d. with density 


Ff (ald, 0) = Aexp(—X(a — 8)), z > 90, 


where A > 0 and —oo < 0 < œ are both unknown. Let the target model 
be 


1 
Mo :9=0,7(A) « ` 
Suppose the smallest order statistic, T = X,(;) is considered a suitable 
model departure statistic for this problem. 
(a) Show that T|\ ~ exponential(nA) under Mo. 
(b) Show that A|x.,, ~ Gamma(n, n¥.5,) under Mo. 
(c) Show that 
ng. 
mtx SDE 
( | obs) (t+ obs)” t! 
(d) Compute the posterior predictive P-value. 
(e) Show that as tobs —> oo, the posterior predictive P-value does not 
necessarily approach 0. (Note that tobs < Zops —? œ also.) 


18. 


19. 


20. 


21: 


22: 
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(Contingency table) Casella and Berger (1990) present the following two- 
way table, which is the outcome of a famous medical experiment conducted 
by Joseph Lister. Lister performed 75 amputations with and without using 
carbolic acid. 


Patient|Carbolic Acid Used? 
Lived? | Yes No 





Test for association of patient mortality with the use of carbolic acid on 
the basis of the above data using (a) BIC and (b) the classical likelihood 
ratio test. Discuss the different probabilistic interpretations underlying 
the two tests. 

On the basis of the data on food poisoning presented in Table 2.1, you 
have to test whether potato salad was the cause. (Do this separately for 
Crab-meat and No Crab-meat). 

(a) Formulate this as a problem of testing a sharp null against the alter- 
native that the null is false. 

(b) Test the sharp null using BIC. 

(c) Test the same null using the classical likelihood ratio test. 

(d) Discuss whether the notions of classical Type 1 and Type 2 error 
probabilities make sense here. 

Using the BIC analyze the data of Problem 19 to explore whether crab- 
meat also contributed to food poisoning. 

(Goodness of fit test). Feller (1973) presents the following data on bomb- 
ing of London during World War Il. The entire area of South London 
is divided into 576 small regions of equal area and the number (np) of 
regions with exactly k bomb hits are recorded. 


5 and above 





229/211 


Test the null hypothesis that bombing was at random rather than the 
general belief that special targets were being bombed. 

(Hint: Under Ho use the Poisson model, under the alternative use the full 
multinomial model with 5 parameters and use BIC.) 

(Hald’s regression data). We present below a small set of data on heat 
evolved during the hardening of Portland cement and four variables that 
may be related to it (Woods et al. (1932), pp. 635-649). The sample 
size (n) is 13. The regressor variables (in percent of the weight) are zı = 
calcium aluminate (3Cao.Al2O3), x2 = tricalcium silicate (3CaO.SiO2), £3 
= tetracalcium alumino ferrite (4CaO.Al203.Fe203), and z4 = dicalcium 
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6 Hypothesis Testing and Model Selection 


Table 6.8. Cement Hardening Data 


T1 T2 T3 T4 yY 
7 26 660 78.6 
1 29 15 52 74.3 
11 56 8 20 104.3 
1131 847 87.6 
752 633 95.9 
1155 9 22 109.2 
3 71 17 6102.7 
1 31 22 44 72.5 
2 54 18 22 93.1 
21 47 426 115.9 
1 40 23 34 83.8 
11 66 912 113.3 
10 68 8 12 109.4 


silicate (2CaO.SiO2). The response variable is y = total calories given off 
during hardening per gram of cement after 180 days. 


Usually such a data set is analyzed using normal linear regression model 
of the form 


Yi = Po + B1£ii + Gora: + + BpTpi +i, 7=1,...,N, 


where p is the number of regressor variables in the model, fo, 81, . - - Bp are 
unknown parameters, and e€;’s are independent errors having a N(0, 07) 
distribution. There are a number of possible models depending on which 
regressor variables are kept in the model. Analyze the data and choose 
one from this set of possible models using (a) BIC, (b) AIBF of the full 
model relative to all possible models. 


T 


Bayesian Computations 


Bayesian analysis requires computation of expectations and quantiles of prob- 
ability distributions that arise as posterior distributions. Modes of the densi- 
ties of such distributions are also sometimes used. The standard Bayes esti- 
mate is the posterior mean, which is also the Bayes rule under the squared 
error loss. Its accuracy is assessed using the posterior variance, which is again 
an expected value. Posterior median is sometimes utilized, and to provide 
Bayesian credible regions, quantiles of posterior distributions are needed. If 
conjugate priors are not used, as is mostly the case these days, posterior dis- 
tributions will not be standard distributions and hence the required Bayesian 
quantities (i.e., posterior quantities of inferential interest) cannot be computed 
in closed form. Thus special techniques are needed for Bayesian computations. 


Example 7.1. Suppose X is N(0, 0°) with known o? and a Cauchy(u, 7) prior 
on @ is considered appropriate from robustness considerations (see Chapter 3, 
Example 3.20). Then 


n(O|a) x exp (—(0 — x)? /(20?)) (T? + (0— WA, 
and hence the posterior mean and variance are 
fo. @exp (E) (T? + (0 - Ta dé 
[2 exp (EÈ) (T2 + (0 — u)?! dO 
fZ 0? exp (-S) (T? + (0 — u)? dé 
fo, exp (- aoe | (T2 + (0 — 1)2)~* dO 


Be Ola l= 


, and 


V"(6|x) = ~(E"(6\z))?. 





Note that the above integrals cannot be computed in closed form, but 
various numerical integration techniques such as IMSL routines or Gaussian 
quadrature can be efficiently used to obtain very good approximations of these. 
On the other hand, the following example provides a more difficult problem. 
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Example 7.2. Suppose X1, X2,..., Xk are independent Poisson counts with 
X; ~ Poisson(6;). 9; are a priori considered related, and a joint multivariate 
normal prior distribution on their logarithm is assumed. Specifically, let v; = 
log(@;) be the ith element of v and suppose 


v ~ Ny (u1, T? {(1 — Ik + p11'}) , 


where 1 is the k-vector with all elements being 1, and u, T? and p are known 
constants. Then, because 


k 
f(x|v) = exp -> 2 — ViTi ? DIEZ 
i=1 


and 
ER (sa — ply’ (0 — ple + p11) (w - m1) 


we have that 
Tvix) x 
exp {— Diae" — vizi} — zh (V — wl)! (1 p) + p11) (v - py}, 


Therefore, if the posterior mean of ĝ; is of interest, we need to compute 


= exp(v; )g(v|x) dv 


E” (6;|2) = E*(exp(v;)|x) = Tex 9(U]x) dv 


where g(v|x) = 
exp {— hy fe! — vizi} — pha (v — wl)! ((1= pe + 11"? (v — p1)}. 


This is a ratio of two k-dimensional integrals, and as k grows, the integrals 
become less and less easy to work with. Numerical integration techniques fail 
to be an efficient technique in this case. This problem, known as the curse 
of dimensionality, is due to the fact that the size of the part of the space 
that is not relevant for the computation of the integral grows very fast with 
the dimension. Consequently, the error in approximation associated with this 
numerical method increases as the power of the dimension k, making the 
technique inefficient. In fact, numerical integration techniques are presently 
not preferred except for single and two-dimensional integrals. 


The recent popularity of Bayesian approach to statistical applications is 
mainly due to advances in statistical computing. These include the E-M algo- 
rithm discussed in Section 7.2 and the Markov chain Monte Carlo (MCMC) 
sampling techniques that are discussed in Section 7.4. As we see later, Bayesian 
analysis of real-life problems invariably involves difficult computations while 
MCMC techniques such as Gibbs sampling (Section 7.4.4) and Metropolis- 
Hastings algorithm (M-H) (Section 7.4.3) have rendered some of these very 
difficult computational tasks quite feasible. 
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7.1 Analytic Approximation 


This is exactly what we saw in Section 4.3.2 where we derived analytic large 
sample approximations for certain integrals using the Laplace approximation. 
Specifically, suppose 


x 99) f(x|@)x(@) dé 
B*(g(8) x) = RESCOS PEE (7.1) 


is the Bayesian quantity of interest where g, f, and m are smooth functions of 
f: 


First, consider any integral of the form 
I =| q(@) exp (—nh(@)) dd, 
Rk 


where h is a smooth function with —h having its unique maximum at 0. 
Then, as indicated in Section 4.3.1 for the univariate case, the Laplace method 
involves expanding g and h about @ in a Taylor series. Let h’ and q’ denote the 
vectors of partial derivatives of h and q, respectively, and A, and A, denote 
the Hessians of h and q. Then writing 


h(8) = hÔ) + (6 —8)'n' 6) + =(6 —8)'A,(6)(0 8) +-- 
= h(@) + (o — 6)'A;,(@)\(6 — 8) +- and 
q(0) = q(@) + (@— 6)'q'(0) + (o = 0)'A,(0)(0 = 6) TANA 


we obtain 
1= | f+ (0-846) + -VADO - 8) +---} 
xe?) exp (—= (0 — 6)'An(0)(6 - 6) +- -) d0 
= e-h) (2m)t/2n 12] A0) (46) + 0m3), (7.2) 


which is exactly (4.16). Upon applying this to both the numerator and denom- 
inator of (7.1) separately (with q equal to g and 1), a first-order approximation 


E* (9(8)|x) = 9() {1+ O(n)} 


easily emerges. It also indicates that a second-order approximation may be 
available if further terms in the Taylor series expansion are retained in the 
approximation. 

Suppose that g in (7.1) is positive, and let —nh(@) = log f(x!@) + log 7(@), 
—nh*(@) = —nh(@) + log g(@). Now apply (7.2) to both the numerator and 
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denominator of (7.1) with q equal to 1. Then, letting 8* denote the maximum 
of —h*, X = A,'(6), L* = A; (0*), as mentioned in Section 4.3.2, Tierney 
and Kadane (1986) obtain the fantastic approximation 


|*|1/2 exp (—nh*(6")) 
E” (g(@)|x) = ————_>_~ {1+ O(n™)}, (7.3) 
|5]1/2 exp (—nh(@)) 


which they call fully exponential. This technique can be used in Example 7.2. 
Note that to derive the approximation in (7.3), it is enough to have the prob- 
ability distribution of g(@) concentrate away from the origin on the positive 
side. Therefore, often when g is non-positive, (7.3) can be applied after adding 
a large positive constant to g, and this constant is to be subtracted after 
obtaining the approximation. Some other analytic approximations are also 
available. Angers and Delampady (1997) use an exponential approximation 
for a probability distribution that concentrates near the origin. We will not 
be emphasizing any of these techniques here, including the many numerical in- 
tegration methods mentioned previously, because the availability of powerful 
and all-purpose simulation methods have rendered them less powerful. 


7.2 The E-M Algorithm 


We shall use a slightly different notation here. Suppose Y|@ has density 
f(y|@), and suppose the prior on @ is 7(@), resulting in the posterior den- 
sity 7(@|y). When 7(@ly) is computationally difficult to handle, as is usually 
the case, there are some ‘data augmentation’ methods that can help. The 
idea is to augment the observed data y with missing or latent data z to ob- 
tain the ‘complete’ data x = (y,z) so that the augmented posterior density 
m(O|x) = 1(@ly, z) is computationally easy to handle. The E-M algorithm (see 
Dempster et al. (1977), Tanner (1991), or McLachlan and Krishnan (1997)) 
is the simplest among such data augmentation methods. In our context, the 
E-M algorithm is meant for computing the posterior mode. However, if data 
augmentation yields a computationally simple posterior distribution, there 
are more powerful computational tools available that can provide a lot more 
information on the posterior distribution as will be seen later in this chapter. 
The basic steps in the iterations of the E-M algorithm are the following. 


Let p(zly, ô) be the predictive density of Z given y and an estimate Ê of 0. 


Find z®) = E(Zly, 6°), where a is the estimate of @ used at the ith step of 


the iteration. Note the similarity with estimating missing values. Use z® to 
Scalp in AGF f 
augment y and maximize 7(O|y,z™) to obtain 6 : Then find z+) using 


a(t+1 : : i i f ; 
a and continue this iteration. This combination of expectation followed 


by maximization in each iteration gives its name to the E-M algorithm. 
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Implementation of the E-M Algorithm 
Note that because 1(@|y) = (6, zly)/p(zly, 8), we have that 
log (Aly) = log x(6, zly) — log p(zly, 9). 


Taking expectation with respect to zô”, y on both sides, we get 


z (i) z (2) 
log r(@ly) = J 10870, zly)plzly,ô )dz— | 1ogp(zly,6)plzly,ô 


a (7.4) 


(where Q and H are according to the notation of Dempster et al. (1977)). 
Then, the general E-M algorithm involves the following two steps in the ith 
iteration: È 

E-Step: Calculate Q(0,0 1 
M-Step: Maximize Q(0, a” 


) dz 


= Q(6,0°") — H(0,ô 


) with respect to @ and obtain à” such that 
max Q(0,ĝ”) = QHD, 6), 
Note that 


(i+1) (7) 


; (i) TE O 
log (8 ly) — log ry) = {QHD 8) — Q(0, A”) 


- {u0 6") — 1(6,6)\. 


From the E-M algorithm, we have that Q(@°*)), a H > Q(0® 9 


for any 9, 


). Further, 


H(0,60°”) — H(e®, 6”) 
a (i) i a (4) 
7 [ros ptaly, @)plaly, 8 )dz— | rogp(zly, 0 play, 6 ) dz 


= fios | p(zly, 0) dz 


=~ fe Fy | P00) 


<0, 


because, for any two densities pı and po, f log(pi(x)/pe(x))pi(x) dx is the 
Kullback-Leibler distance between pı and p2, which is at least 0. Therefore, 


(i) a (i) 
)s 


He) 6’) — H(e ô 


and hence pe e RO 
(OY ly) > z” ly) 


for any iteration i. Therefore, starting from any point, the E-M algorithm can 
usually be expected to converge to a local maximum. 
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Table 7.1. Genetic Linkage Data 


Cell Count|Probability 
yı 
Y2 
y3 
Ya 





Example 7.3. (genetic linkage model.) Consider the data from Rao (1973) on 
a certain recombination rate in genetics (see Sorensen and Gianola (2002) for 
details). Here 197 counts are classified into 4 categories as shown in Table 7.1, 
along with the corresponding theoretical cell probabilities. 

The multinomial mass function in this example is given by f(y|@) x (2+ 
0)” (1 — 9)¥2+¥3 944, so that under the uniform(0,1) prior on 8, the observed 
posterior density is given by 


m(Oly) x (2+ 0” (1 — 0) tg, 


This is not a standard density due to the presence of 2+0. If we split the first 
cell into two with probabilities 1/2 and 6/4, respectively, the complete data 
will be given by x = (£1, £2, £3, £4, £5), where £1 + £2 = Y1, T3 = Yo, L4 = Y3 
and z5 = y4. The augmented posterior density will then be given by 


m(O|x) x 972775 (1 — A)t3 te, 


which corresponds with the Beta density. 
The E-step of E-M consists of obtaining 


Q(0,6) = E (X2 + X5) log + (X3 + X4) log(1 — 6)ly, 6] 


= {E [X2ly, 8] $ va} log @ + (y2 + y3)log(1 — 6). (7.5) 


The M-step involves finding 6@+) to maximize (7.5). We can do this by solving 
£Q(0,0) = 0, so that 


E | Xaly,6| + ya 


r= (7.6) 


E | Xoly, ĝo] + ya + Y2 + Y3 
Now note that E Xely, | = fy | X2)Xi + X2,60], and that 


X2|X1 + Xo, MORS binomial(Xı + Xo, eri] Therefore, 


Aa 
2+ Ati)’ 





E [XX +X = 1,6] = yı 


and hence 
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Table 7.2. E-M Iterations for Genetic Linkage Data Example 


Iteration i 6 

= I 60825 
2 .62432 
3 .62648 
4 .62678 
5 .62682 
6 .62682 


Ac) 
ĝG+1) — Oo Hota (7.7) 
= l 
haao t¥2tys t+ ys 





In our example, (7.7) converges to Ô = .62682 in 5 iterations starting from 
9) = .5 as shown in Table 7.2. 


7.3 Monte Carlo Sampling 


Consider an expectation that is not available in closed form. An alternative to 
numerical integration or analytic approximation to compute this is statistical 
sampling. This probabilistic technique is a familiar tool in statistical infer- 
ence. To estimate a population mean or a population proportion, a natural 
approach is to gather a large sample from this population and to consider 
the corresponding sample mean or the sample proportion. The law of large 
numbers guarantees that the estimates so obtained will be good provided the 
sample is large enough. Specifically, let f be a probability density function (or 
a mass function) and suppose the quantity of interest is a finite expectation 
of the form 


Byh(X) = | hx) f(x) ax (7.8) 


(or the corresponding sum in the discrete case). If i.i.d. observations X1, Xe2,... 
can be generated from the density f, then 


hm = — YD A(X:) (7.9) 


converges in probability (or even almost surely) to Eșph(X). This justifies 
using hm as an approximation for E fh(X) for large m. To provide a measure 
of accuracy or the extent of error in the approximation, we can again use a 
statistical technique and compute the standard error. If Var;h(X) is finite, 
then Varş(hm) = Varsh(X)/m. Further, Varph(X) = Esh? (X) — (Eph(X))” 
can be estimated by 


212 7 Bayesian Computations 


m 


s2, = = PO (A(K:) — hm)’, 


1=1 
and hence the standard error of hm can be estimated by 


Foon = OO R 


i=l 


If one wishes, confidence intervals for Eyh(X) can also be provided using the 
central limit theorem. Because 


ym (hm + Esh(X)) 


Sm 


— N(0,1) 
m7 OO 
in distribution, (hm — Za/28m/V/M,hm + 2/28m//m) can be used as an 
approximate 100(1 — a)% confidence interval for Eyh(X), with z.,/2 denoting 
the 100(1 — a/2)% quantile of standard normal. 

The above discussion suggests that if we want to approximate the posterior 
mean, we could try to generate i.i.d. observations from the posterior distribu- 
tion and consider the mean of this sample. This is rarely useful because most 
often the posterior distribution will be a non-standard distribution which may 
not easily allow sampling from it. Note that there are other possibilities as 
seen below. 


Example 7.4. (Example 7.1 continued.) Recall that 


es 0 exp (CG) (T? + (0 -— p)?2)~ dé 
fe exp (EF) (7? + (0 - wy)? do 


_ Pot a (F (PHO -i a 

tees) FOE ae 
where @ denotes the density of standard normal. Thus Æ” (0|x) is the ratio of 
expectation of h(@) = 6/(7? + (0 — u?) to that of h(@) = 1/(7? + (6 — p)?), 


both expectations being with respect to the N (x,a?) distribution. Therefore, 
we simply sample 61, 82,... from N (z,o?) and use 


6: (72 + (6i — u)2) 
S (72 + (6: — w)2)? 


as our Monte Carlo estimate of E7(@|x). Note that (7.8) and (7.9) are applied 
separately to both the numerator and denominator, but using the same sample 
of @’s. 

It is unwise to assume that the problem has been completely solved. The 
sample of @’s generated from N(z,a*) will tend to concentrate around z, 


E™(6\x) = 


E*(6|x) = 
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whereas to satisfactorily account for the contribution of the Cauchy prior to 
the posterior mean, a significant portion of the @’s should come from the tails 
of the posterior distribution. It may therefore appear that it is perhaps better 
to express the posterior mean in the form 


fo. @ exp (G) m(0) dé 
f= exp (- oS | m(0) dé 
then sample @’s from Cauchy(j,7) and use the approximation 


ae yoga OF EXD (e) 
A agen PAE 
Dimi EXP (e) 


However, this is also not totally satisfactory because the tails of the posterior 
distribution are not as heavy as those of the Cauchy prior, and hence there 
will be excess sampling from the tails relative to the center. The implication 
is that the convergence of the approximation is slower and hence a larger 
error in approximation (for a fixed m). Ideally, therefore, sampling should be 
from the posterior distribution itself for a satisfactory approximation. With 
this view in mind, a variation of the above theme has been developed. This is 
called the Monte Carlo importance sampling. 


E” (Oz) = 





Consider (7.8) again. Suppose that it is difficult or expensive to sample 
directly from f, but there exists a probability density u that is very close to 
f from which it is easy to sample. Then we can rewrite (7.8) as 


E;h(X) = n h(x) f (xc) dx 


_ (x) 

= | n) ule) dx 
= f (hw) u(x) dx 

= Ey {h(X)w(X)}, 


where w(x) = f(x)/u(x). Now apply (7.9) with f replaced by u and h replaced 
by hw. In other words, generate i.i.d. observations X1, X2, ... from the density 


u and compute 
1 m 
hwm = — A(X X 
w T2 (X;)w 


The sampling density u is called the importance function. We illustrate im- 
portance sampling with the following example. 
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Example 7.5. Suppose X1, X2,..., Xn are iid. N(0,07), where both 8 and 
g? are unknown. Independent priors are assumed for 8 and ao”, where 6 has a 
double exponential distribution with density exp(—|6|)/2 and g? has the prior 
density of (1 + 07)~*. Neither of these is a standard prior, but robust choice 
of proper prior all the same. If the posterior mean of @ is of interest, then it 


is necessary to compute 
E"(6|x) = / J R T 
~oo JQ) 


Because 7(6, 07|x) is not a standard density, let us look for a standard density 
close to it. Letting Z denote the mean of the sample 21, £2,..., £n and s2 = 
X; (ti — B)*/n, note that 


(8, 07|x) x (07)-"/? exp (—5"5 {(8 — 2)? + 52} ) exp(—l6|)(1 + 0?)~? 


_ [s2 + (0 — al (a?) 22) exp -= {(0 — zt)? + s3}) 





 { [88 + (0 = 0] OY} expl- 5) 
x u (o7|)u2 (8) expN a). 


where u;(o7|6) is the density of inverse Gamma with shape parameter n/2+1 
and scale parameter 3{(0 — Z)? + s2}, and ug is the Student’s t density with 
d.f. n+ 1, location % and scale a multiple of sn. It may be noted that the 
tails of exp(—|9|)(<2-2)? do not have much of an influence in the presence of 
u1(o7|6)u2(0). Therefore, u(@, o?) = ui (o7|6)u2(0) may be chosen as a suitable 
importance function. This involves sampling @ first from the density u2(@), and 
given this 6, sampling g? from u;(a*|0). This is repeated to generate further 
values of (6,07). Finally, after generating m of these pairs (0,07), the required 
posterior mean of 0 is approximated by 


4 0;w(6;,07) 
Jaai w(0;, o?) l 


where w(0,0?) = f(x|0, 07)2(0, 07) /u(0, 0°). 


ET (6|x) = 


In some high-dimensional problems, a combination of numerical integra- 
tion, Laplace approximation and Monte Carlo sampling seems to give ap- 
pealing results. Delampady et al. (1993) use a Laplace-type approximation to 
obtain a suitable importance function in a high-dimensional problem. 

One area that we have not touched upon is how to generate random de- 
viates from a given probability distribution. Clearly, this is a very important 
subject being the basis of any Monte Carlo sampling technique. Instead of 
providing a sketchy discussion from this vast area, we refer the reader to 
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the excellent book by Robert and Casella (1999). We would, however, like to 
mention one recent and very important development in this area. This is the 
discovery of a very efficient algorithm to generate a sequence of uniform ran- 
dom deviates with a very big period of 219997 — 1. This algorithm, known as 
the Mersenne twister (MT), has many other desirable features as well, details 
on which may be found in Matsumoto and Nishimura (1998). The property of 
having a very large period is especially important because Monte Carlo sim- 
ulation methods, especially MCMC, require very long sequences of random 
deviates for proper implementation. 


7.4 Markov Chain Monte Carlo Methods 


7.4.1 Introduction 


A severe drawback of the standard Monte Carlo sampling or Monte Carlo 
importance sampling is that complete determination of the functional form 
of the posterior density is needed for their implementation. Situations where 
posterior distributions are incompletely specified or are specified indirectly 
cannot be handled. One such instance is where the joint posterior distribu- 
tion of the vector of parameters is specified in terms of several conditional 
and marginal distributions, but not directly. This actually covers a very large 
range of Bayesian analysis because a lot of Bayesian modeling is hierarchical 
so that the joint posterior is dificult to calculate but the conditional posteri- 
ors given parameters at different levels of hierarchy are easier to write down 
(and hence sample from). For instance, consider the normal-Cauchy problem 
of Example 7.1. As shown later in Section 7.4.6, this problem can be given a 
hierarchical structure wherein we have the normal model, the conjugate nor- 
mal prior in the first stage with a hyperparameter for its variance and this 
hyperparameter again has the conjugate prior. Similarly, consider Example 7.2 
where we have independent observations X; ~ Poisson(6;). Now suppose the 
prior on the ĝ;’s is a conjugate mixture. We again see (Problem 14) that 
a hierarchical prior structure can lead to analytically tractable conditional 
posteriors. It turns out that it is indeed possible in such cases to adopt an 
iterative Monte Carlo sampling scheme, which at the point of convergence will 
guarantee a random draw from the target joint posterior distribution. These 
iterative Monte Carlo procedures typically generate a random sequence with 
the Markov property such that this Markov chain is ergodic with the lim- 
iting distribution being the target posterior distribution. There is actually a 
whole class of such iterative procedures collectively called Markov chain Monte 
Carlo (MCMC) procedures. Different procedures from this class are suitable 
for different situations. 

As mentioned above, convergence of a random sequence with the Markov 
property is being utilized in this procedure, and hence some basic under- 
standing of Markov chains is required. This material is presented below. This 
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discussion as well as the following sections are mainly based on Athreya et al. 
(2003). 


7.4.2 Markov Chains in MCMC 


A sequence of random variables {Xn }n>0 is a Markov chain if for any n, given 
the current value, Xn, the past {X;,j7 < n—1} and the future {X;: j >n+1} 
are independent. In other words, 


P(AN B|Xn) = P(A|X,,) P(B|Xn), (7.10) 


where A and B are events defined respectively in terms of the past and the 
future. Among Markov chains there is a subclass that has wide applicability. 
They are Markov chains with time homogeneous or statzonary transition prob- 
abilities, meaning that the probability distribution of X,4, given Xn = x, and 
the past, X; : 7 < n— 1 depends only on z and does not depend on the values 
of Xj; : j <n-—lor n. If the set S of values {Xn} can take, known as the 
state space, is countable, this reduces to specifying the transition probability 
matrix P = ((p;;)) where for any two values i,j in S, pi; is the probability 
that X,41 = J given Xn = 1, i.e., of moving from state 7 to state j in one time 
unit. For state space S that is not countable, one has to specify a transition 
kernel or transition function P(x,-) where P(x, A) is the probability of mov- 
ing from z into A in one step, i.e., P(Xn41 € A|Xn = x). Given the transition 
probability and the probability distribution of the initial value Xo, one can 
construct the joint probability distribution of {X; :0 < j < n} for any finite 
n. For example, in the countable state space case 


P(Xo = io, Xi = i1, ..., Xn-1 = tn-1, Xn = in) 
=P (Xe = ir Xo = ienr et ae) 
x P( Xo = 19, X1 = 21, ates Xn-1 = in—1) 

= Pin_1in P(Xo = to,.--; Xn-1 = in-1) 

SP KG 16) Doi Divs ois oe 
A probability distribution ~ is called stationary or invariant for a transition 
probability P or the associated Markov chain {X,} if it is the case that 
when the probability distribution of Xo is 7 then the same is true for Xn for 
all n > 1. Thus in the countable state space case a probability distribution 


mt = {r;: i € S} is stationary for a transition probability matrix P if for each 
jin S, 


P(X; = j) = $ P(X = §|Xo = i)P(Xo = i) 


2 


DP = P(Xo = 5) = 73. (7.11) 
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In vector notation it says m = (71, T2,...) is a left eigenvector of the matrix 
P with eigenvalue 1 and 


T— TAP. (7.12) 


Similarly, if S is a continuum, a probability distribution m with density p(x) 
is stationary for the transition kernel P(.,-) if 


RAYS a ae J, PaA 


for all AC S. 

A Markov chain {X,,} with a countable state space S and transition prob- 
ability matrix P = ((p;;)) is said to be irreducible if for any two states 7 and j 
the probability of the Markov chain visiting 7 starting from 2 is positive, i.e., 
for some n > 1, pon) = P(X, = j| Xo = i) > 0. A similar notion of irreducibil- 
ity, known as Harris or Doeblin irreducibility exists for the general state space 
case also. For details on this somewhat advanced notion as well as other results 
that we state here without proof, see Robert and Casella (1999) or Meyn and 
Tweedie (1993). In addition, Tierney(1994) and Athreya et al. (1996) may be 
used as more advanced references on irreducibility and MCMC. In particular, 
the last reference uses the fact that there is a stationary distribution of the 
Markov chain, namely, the joint posterior, and thus provides better and very 
explicit conditions for the MCMC to converge. 


Theorem 7.6. (law of large numbers for Markov chains) Let {Xn}n>0 
be a Markov chain with a countable state space S and a transition probability 
matrix P. Further, suppose it is irreducible and has a stationary probability 
distribution n = (m; : i € S) as defined in (7.11). Then, for any bounded 
function h: S — R and for any initial distribution of Xo 


1 n—l 


i=0 
in probability as n —> oo. 


A similar law of large numbers (LLN) holds when the state space S is not 
countable. The limit value in (7.13) will be the integral of h with respect to the 
stationary distribution 7. A sufficient condition for the validity of this LLN 
is that the Markov chain {X,,} be Harris irreducible and have a stationary 
distribution 7. 

To see how this is useful to us, consider the following. Given a probability 
distribution m on a set S, and a function h on S, suppose it is desired to 
compute the “integral of h with respect to x”, which reduces to >), h(j)7; 
in the countable case. Look for an irreducible Markov chain {Xn} with state 
space S and stationary distribution 7. Then, starting from some initial value 
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Xo, run the Markov chain {X;} for a period of time, say 0,1,2,...n—1 and 
consider as an estimate 


n—I 
1 

‘at ROX): .14 
H a 2 (X;) (7.14) 

By the LLN (7.13), this estimate un will be close to >|, h(j)r; for large n. 
This technique is called Markov chain Monte Carlo (MCMC). For example, 
if one is interested in n(A) = )°,., Tj for some A C S then by LLN (7.13) 

this reduces to 


jEA 
n—l 


T = SO La(X;) 3 n(A) 
0 


in probability as n — oo, where I4(X,;) = 1 if X; € A and 0 otherwise. 
An irreducible Markov chain {Xn} with a countable state space S is called 


aperiodic if for some i € S the greatest common divisor, g.c.d. {n : pi >0}= 
1. Then, in addition to the LLN (7.13}, the following result on the convergence 
of P(X, = j) holds. 


D IP(Xn = 3) — 15] + 0 (7.15) 


J 


as n — œ, for any initial distribution of Xo. In other words, for large n the 
probability distribution of Xn will be close to m. There exists a result similar 
to (7.15) for the general state space case that asserts that under suitable 
conditions, the probability distribution of X, will be close to 7 as n > oo. 

This suggests that instead of doing one run of length n, one could do N 
independent runs each of length m so that n = Nm and then from the it” 
run use only the m*® observation, say, Xm,; and consider the estimate 


N 
1 3 
T m = Ama -. 7 


Other variations exist as well. Some of the special Markov chains used in 
MCMC are discussed in the next two sections. 


7.4.3 Metropolis-Hastings Algorithm 


In this section, we discuss a very general MCMC method with wide applica- 
tions. It will soon become clear why this important discovery has led to very 
considerable progress in simulation-based inference, particularly in Bayesian 
analysis. The idea here is not to directly simulate from the given target den- 
sity (which may be computationally very difficult) at all, but to simulate an 
easy Markov chain that has this target density as the density of its stationary 
distribution. We begin with a somewhat abstract setting but very soon will 
get to practical implementation. 
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Let S be a finite or countable set. Let m be a probability distribution on 
S. We shall call 7 the target distribution. (There is room for slight confusion 
here because in our applications the target distribution will always be the 
posterior distribution, so let us note that m here does not denote the prior 
distribution, but just a standard notation for a generic target.) Let Q = ((q:;)) 
be a transition probability matrix such that for each 7, it is computationally 
easy to generate a sample from the distribution {q;; : 7 € S}. Let us generate 
a Markov chain {X,,} as follows. If X,, = i, first sample from the distribution 
{qi; : 7 € S} and denote that observation Yp. Then, choose X,41 from the 
two values X, and Y, according to 


PX 4 amp Yl Aa Yn) = AXn, Y) 
PU Xn = A alAns Ya) =1— AXn, Ya) (7.17) 


where the “acceptance probability” p(-,-) is given by 


a f Nj dji 
4,9) = mm mua) 7.18 
oli) = min f 2 (7.18) 
for all (i,j) such that miq; > 0. Note that {Xn} is a Markov chain with 
transition probability matrix P = ((pi;)) given by 


Qij Pij jFt, 
Pig = 1-— Do pie, JG =i. (7.19) 
REE 


Q is called the “proposal transition probability” and p the “acceptance prob- 
ability”. A significant feature of this transition mechanism P is that P and 7 
satisfy 


tipa = Pu for allt, 7. (7.20) 


This implies that for any 7 
YO mpy = t DDH = TH, (7:21) 
į t 


or, 7 is a Stationary probability distribution for P. 

Now assume that S is irreducible with respect to Q and 7; > 0 for all z in 
S. It can then be shown that P is irreducible, and because it has a stationary 
distribution 7, LLN (7.13) is available. This algorithm is thus a very flexible 
and useful one. The choice of Q is subject only to the condition that S is 
irreducible with respect to Q. Clearly, it is no loss of generality to assume 
that m; > 0 for all ¿ in S. A sufficient condition for the aperiodicity of P is 
that p;; > 0 for some i or equivalently 


` qijPij < 1. 


j#1 
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A sufficient condition for this is that there exists a pair (i, j) such that m:q;; > 
0 and njgji < Tiqij. 

Recall that if P is aperiodic, then both the LLN (7.13) and (7.15) hold. 
If S is not finite or countable but is a continuum and the target distribution 
m(-) has a density p(-), then one proceeds as follows: Let Q be a transition 
function such that for each z, Q(a,-) has a density q(x, y). Then proceed as 
in the discrete case but set the “acceptance probability” p(x, y) to be 


ply)aly, z) i} 


py) = man Eora y)’ 


for all (x,y) such that p(x)q(z,y) > 0. A particularly useful feature of the 
above algorithm is that it is enough to know p(-) upto a multiplicative constant 
as in the definition of the “acceptance probability” p(.,-), only the ratios 
p(y)/p(x) need to be calculated. (In the discrete case, it is enough to know {7; } 
upto a multiplicative constant because the “acceptance probability” p(-,-) 
needs only the ratios 7;/7;.) This assures us that in Bayesian applications 
it is not necessary to have the normalizing constant of the posterior density 
available for computation of the posterior quantities of interest. 


7.4.4 Gibbs Sampling 


As was pointed out in Chapter 2, most of the new problems that Bayesians are 
asked to solve are high-dimensional. Applications to areas such as micro-arrays 
and image processing are some examples. Bayesian analysis of such problems 
invariably involve target (posterior) distributions that are high-dimensional 
multivariate distributions. In image processing, for example, typically one has 
N x N square grid of pixels with N = 256 and each pixel has k > 2 possible 
values. Thus each configuration has (256)? components and the state space S 
has k(256)” configurations. To simulate a random configuration from a target 
distribution over such a large S is not an easy task. The Gibbs sampler is a 
technique especially suitable for generating an irreducible aperiodic Markov 
chain that has as its stationary distribution a target distribution in a high- 
dimensional space but having some special structure. The most interesting 
aspect of this technique is that to run this Markov chain, it suffices to generate 
observations from univariate distributions. 

The Gibbs sampler in the context of a bivariate probability distribution 
can be described as follows. Let m be a target probability distribution of a 
bivariate random vector (X,Y). For each z, let P(z,-) be the conditional 
probability distribution of Y given X = z. Similarly, let Q(y,-) be the con- 
ditional probability distribution of X given Y = y. Note that for each zx, 
P(z,-) is a univariate distribution, and for each y, Q(y,-) is also a univariate 
distribution. Now generate a bivariate Markov chain Zn = (Xn, Yn) as follows: 

Start with some Xo = Xo. Generate an observation Yo from the distribution 
P(ao,-). Then generate an observation X, from Q(Yo,-). Next generate an 
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observation Y; from P(X;,-) and so on. At stage n if Zn = (Xn, Yn) is known, 
then generate Xn+1 from Q(Yn,-) and Yn+41 from P(Xp+1,:). 

If x is a discrete distribution concentrated on {(x;,y;):1 <i < K,1 < 
j < L} and if n; = x(a, y;) then P(xi,y;) = i; /7m:. and 


a eel 

OY; j T) T.j ’ 

where 7;. = Weare T.j = ak Tij- Thus the transition probability matrix 
R = ((raj),(ee))) for the {Zn} chain is given by 


rij) (ke) = QU i. £k) P(Ek, Ye) 
_ kj Te 
E N.i Tk i 
It can be verified that this chain is irreducible, aperiodic, and has 7 as its 
stationary distribution. Thus LLN (7.13) and (7.15) hold in this case. Thus 
for large n, Zn can be viewed as a sample from a distribution that is close to 
m and one can approximate )); , h(t,7j)mi; by Sc ae 


, , X 0, rl p 
As an illustration, consider sampling from & ~ Not ( 9) ie ly 
Note that the conditional distribution of X given Y = y and that of Y given 
X =2 are 


X|Y =y ~ N(py,1— p°) and Y|X = z ~ N(pz,1 — p°). (T22) 


Using this property, Gibbs sampling proceeds as described below to generate 
(Xn, Yn) n = 0,1,2,..., by starting from an arbitrary value zo for Xo, and 
repeating the following steps for i = 0,1,...,n. 


1. Given x; for X, draw a random deviate from N(pz;,1— p°) and denote 
it by Y;. | 

2. Given y; for Y, draw a random deviate from N (pyi, 1 — p?) and denote it 
by X41. 


The theory of Gibbs sampling tells us that if n is large, then (£n, Yn) is a 


random draw from a distribution that is close to N2 C $ i | | To see 
why Gibbs sampler works here, recall that a sufficient condition for the LLN 
(7.13) and the limit result (7.15) is that an appropriate irreducibility condition 
holds and a stationary distribution exists. From steps 1 and 2 above and using 
(7.22), one has 


Y; = pXi + V1- p* m 


Xii = PpYi + V1- p &, 
where n; and €; are independent standard normal random variables indepen- 
dent of X;. Thus the sequence {X;} satisfies the stochastic difference equation 


and 
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Xi+ı = Xi + Vian, 


where 
Uni = pV1—p? m+ V1— p? &. 

Because m;, é; are independent N(0,1) random variables, U;+ı is also a nor- 
mally distributed random variable with mean 0 and variance p?(1— p°) + (1 — 
p’) = 1 — ø. Also {U;}i>1 being ii.d., makes {X;};30 a Markov chain. It 
turns out that the irreducibility condition holds here. Turning to stationarity, 
note that if Xo is a N(0,1) random variable, then X; = p?Xpo + U: is also a 
N(0,1) random variable, because the variance of X; = p4+1—p* = 1 and the 
mean of X; is 0. This makes the standard N(0,1) distribution a stationary 
distribution for {Xn}. 

The multivariate extension of the above-mentioned bivariate case is very 
straightforward. Suppose 7 is a probability distribution of a k-dimensional 
random vector (X1, X2,..., Xk). If u = (uj, ue,..., Uk) is any k-vector, let 
U_; = (U1, U2,---,Us—1, Ui41,---, Uk) be the K—1 dimensional vector resulting 
by dropping the ith component u;. Let 7;(-|x_;) denote the univariate con- 
ditional distribution of X; given that X_; = (X1, X2, Xi-1, Xi41,..., Xk) = 
x_;. Now starting with some initial value for Xo = (£01, £02,.-., Zok) gen- 
erate Xi = (X11, Xi2,..., X1k) sequentially by generating X,, according to 
the univariate distribution 7;(-|xo_,) and then generating X12 according to 
™2(-|(X11, £03, T04, . - -, Zok) and so on. The most important feature to recog- 
nize here is that all the univariate conditional distributions, X,;|X_; = X-i, 
known as full conditionals should easily allow sampling from them. This turns 
out to be the case in most hierarchical Bayes problems. Thus, the Gibbs sam- 
pler is particularly well adapted for Bayesian computations with hierarchical 
priors. This was the motivation for some vigorous initial development of Gibbs 
sampling as can be seen in Gelfand and Smith (1990). 

The Gibbs sampler can be justified without showing that it is a special 
case of the Metropolis-Hastings algorithm. Even if it is considered a special 
case, it still has special features that need recognition. One such feature is 
that full conditionals have sufficient information to uniquely determine a mul- 
tivariate joint distribution. This is the famous Hammersley- Clifford theorem. 
The following condition introduced by Besag (1974) is needed to state this 
result. 


Definition 7.7. Let p(y1,..., yx) be the joint density of a random vector Y = 
(Y1,...,¥~) and let p™(y;) denote the marginal density of Y;, i = 1,...,k. 
If p® (yi) > 0 for every i = 1,...,k implies that p(y1,...,yx) > 0, then the 
joint density p is said to satisfy the positivity condition. 


Let us use the notation p;(y;|y1,..-,Ys—1) Yi+1)---> Yk) for the conditional 
density of Y;|Y —; = yi- 


Theorem 7.8. (Hammersley-Clifford) Under the positivity condition, the 
joint density p satisfies 


7.4 Markov Chain Monte Carlo Methods 223 


k 
pi (yj lyi .. 3 Yi: Y; EVAR yi) 
p(Yi -- -> Yk) X | | pe A en el EA 


H Gili amt Thy He) 
for every y and y’ in the support of p. 


Proof. For y and y’ in the support of p, 


P(Y1, +++) Yk) = Pe(YRlYas--->Ye—-1)P(W1,-- +> Ye—-1) 
__ Pr(YalYas- ++» Ye—1) 
Pr(YylY15--+>Ye—1) 
_ Pr(YelYis-+ +s YR—1) Pr-1(Yk-1lY1;:: : 5 Yk-2: Yk) 
PEG p Uipctey UE) PERT Ope lY s Yk- Uy) 

MP ieaie) 


p(y, n. ,Yk—1; Yk) 


i PAU; Uess U Y asses Up) I ; 
sU Gia ce o 


PiU; Vines Vis U pas aU 


It can be shown also that under the positivity condition, the Gibbs sampler 
generates an irreducible Markov chain, thus providing the necessary conver- 
gence properties without recourse to the M-H algorithm. Additional conditions 
are, however, required to extend the above theorem to the non-positive case, 
details of which may be found in Robert and Casella (1999). 


7.4.5 Rao-Blackwellization 


The variance reduction idea of the famous Rao- Blackwell theorem in the pres- 
ence of auxiliary information can be used to provide improved estimators when 
MCMC procedures are adopted. Let us first recall this theorem. 


Theorem 7.9. (Rao-Blackwell theorem) Let 6(X1, X2,..., Xn) be an es- 
timator of 0 with finite variance. Suppose that T is sufficient for 0, and let 
ô*(T), defined by 6*(t) = E(6(X1, Xe,...,Xn)|T = t), be the conditional ex- 
pectation of 6(X1,Xo,...,Xn) given T =t. Then 


BOT) =0)" = EEX uX zasa) — 0). 


The inequality is strict unless 6 = 6*, or equivalently, 6 is already a function 
op I, 


Proof. By the property of iterated conditional expectation, 
ETI) = B | E(6(X4, Xo,.-.,Xn)|T)] = E(6(X1, Xo,-.-, Xn)). 


Therefore, to compare the mean squared errors (MSE) of the two estimators, 
we need to compare their variances only. Now, 
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Var(ô( X1, X2,...,Xn)) = Var [FE (4|T)| + E [Var(6|T)] 
= Var(ô*) + E [Var(6|T)] 
> Var(6"), 


unless Var(6|7') = 0, which is the case only if ô itself is a function of T. O 


The Rao-—Blackwell theorem involves two key steps: variance reduction by 
conditioning and conditioning by a sufficient statistic. The first step is based 
on the analysis of variance formula: For any two random variables S and T, 
because 

Var(S) = Var(E(S|T)) + E(Var(S|T)), 


one can reduce the variance of a random variable S by taking conditional 
expectation given some auxiliary information T. This can be exploited in 
MCMC. 

Let (X;,Y;),7 = 1,2,...,N be the data generated by a single run of 
the Gibbs sampler algorithm with a target distribution of a bivariate ran- 
dom vector (X,Y). Let h(X) be a function of the X component of (X,Y) 
and let its mean value be u. Suppose the goal is to estimate u. A first es- 
timate is the sample mean of the h(X;),j = 1,2,...,N. From the MCMC 
theory, it can be shown that as N — ov, this estimate will converge to 
u in probability. The computation of variance of this estimator is not easy 
due to the (Markovian) dependence of the sequence {X,;,j = 1,2,...,N}. 
Now suppose we make n independent runs of Gibbs sampler and generate 
(Xij Yij),j = 1,2,..., N;i = 1,2,...,n. Now suppose that N is sufficiently 
large so that (Xin, Yin) can be regarded as a sample from the limiting target 
distribution of the Gibbs sampling scheme. Thus (X;n, Yin), i = 1,2,...,n are 
i.i.d. and hence form a random sample from the target distribution. Then one 
can offer a second estimate of 4—the sample mean of h(Xin), i = 1,2,...,7. 
This estimator ignores a good part of the MCMC data but has the advan- 
tage that the variables h(X;n), i = 1,2,...,n are independent and hence the 
variance of their mean is of order n~!. Now applying the variance reduction 
idea of the Rao-Blackwell theorem by using the auxiliary information Y;n, 
i = 1,2,...,n, one can improve this estimator as follows: 

Let k(y) = E(h(X)|Y = y). Then for each i, k(Y;in ) has a smaller variance 
than h(X;n) and hence the following third estimator, 


1 Tt 
= 2, bY), 


has a smaller variance than the second one. A crucial fact to keep in mind 
here is that the exact functional form of k(y) be available for implementing 
this improvement. 
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7.4.6 Examples 


Example 7.10. (Example 7.1 continued.) Recall that X|0 ~ N(6,07) with 
known g? and 6 ~ Cauchy(,7). The task is to simulate 6 from the poste- 
rior distribution, but we have already noted that sampling directly from the 
posterior distribution is difficult. What facilitates Gibbs sampling here is the 
result that the Student’s t density, of which Cauchy is a special case, is a 
scale mixture of normal densities, with the scale parameter having a Gamma 
distribution (see Section 2.7.2, Jeffreys test). Specifically, 


m(0) x (T? + (0 — i ie 


ya 1/2 À 2i aai À 
x | (sa) exp (-s50 u) } A exp( 5) dÀ, 





so that 7(@) may be considered the marginal prior density from the joint prior 
density of (9, A) where 


OIA ~ N(u,T?/A) and A ~ Gamma(1/2, 1/2). 


It can be noted that this leads to an implicit hierarchical prior structure 
with » being the hyperparameter. Consequently, 7(@|x) may be treated as 
the marginal density from 7(0,Alxz). Now note that the full conditionals of 
m(0, Ax) are standard distributions from which sampling is easy. In particular, 


T? XG? O° 
8 ~ N | —— t 4+ h, SS s 
Art (+t 72 4 Ao?” a (1123) 
2 ane 
A\O,x2 ~ A|8 ~ Exponential (=E) ; (7.24) 


Thus, the Gibbs sampler will use (7.23) and (7.24) to generate (@,) from 
T(O, Ale). 


Example 7.11. Consider the following example due to Casella and George 
given in Arnold (1993). Suppose we are studying the distribution of the num- 
ber of defectives X in the daily production of a product. Consider the model 
(X | Y,@) ~ binomial(Y, @), where Y, a day’s production, is a random variable 
with a Poisson distribution with known mean 4, and @ is the probability that 
any product is defective. The difficulty, however, is that Y is not observable, 
and inference has to be made on the basis of X only. The prior distribution 
is such that (0 | Y = y) ~ Beta(a, y), with known a and y independent of 
Y. Bayesian analysis here is not a particularly difficult problem because the 
posterior distribution of 6|X = x can be obtained as follows. First, note that 
X|9 ~ Poisson(A9). Next, 6 ~ Beta(a, y). Therefore, 


m(O|X = x) x exp(—d9)07F° 11 — 6)71 0 <80 <1. (7.25) 
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The only difficulty is that this is not a standard distribution, and hence pos- 
terior quantities cannot be obtained in closed form. Numerical integration 
is quite simple to perform with this density. However, Gibbs sampling pro- 
vides an excellent alternative. Instead of focusing on 6|X directly, view it as 
a marginal component of (Y,@ | X). It can be immediately checked that the 
full conditionals of this are given by 

Y|X = 2,0 ~ z + Poisson(A(1 — 6)), and 

OX =2,Y = y ~ Beta(a + z, y +y- zr) 

both of which are standard distributions. 


Example 7.12. (Example 7.11 continued.) It is actually possible here to sample 
from the posterior distribution using what is known as the accept-reject Monte 
Carlo method. This widely applicable method operates as follows. Let g(x) /K 
be the target density, where K is the possibly unknown normalizing constant 
of the unnormalized density g. Suppose h(x) is a density that can be simulated 
by a known method and is close to g, and suppose there exists a known 
constant c > 0 such that g(x) < ch(x) for all x. Then, to simulate from the 
target density, the following two steps suffice. (See Robert and Casella (1999) 
for details. ) 
Step 1. Generate Y ~ h and U ~ U(0, 1); 
Step 2. Accept X = Y if U < g(Y)/{ch(Y)}; return to Step 1 otherwise. 
The optimal choice for c is sup{g(x)/h(x), but even this choice may result in 
undesirably large number of rejections. 

In our example, from (7.25), 


g(0) = exp(—A9)67T9-1 (1 — 8)7-1 0 < 6 < 1}, 


so that h(@) may be chosen to be the density of Beta(x + a, y). Then, with 
the above-mentioned choice for c, if 0 ~ Beta(x + a, y) is generated in Step 1, 
its ‘acceptance probability’ in Step 2 is simply exp(—A@). Even though this 
method can be employed here, we, however, would like to use this technique 
to illustrate the Metropolis-Hastings algorithm. The required Markov chain is 
generated by taking the transition density q(z,y) = q(y|z) = h(y), indepen- 
dently of z. Then the acceptance probability is 


eee gly)h(z) 
nei) = TT 


= min {exp (—A(y — 2)) ,1}. 


Thus the steps involved in this “independent” M-H algorithm are as follows. 
Start at t = 0 with a value xo in the support of the target distribution; in 
this case, 0 < zo < 1. Given zz, generate the next value in the chain as given 
below. 

(a) Draw Y; from Beta(z + a, y). 

(b) Let 

_ J Ys with probability p 
T+) =) x, otherwise, 
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where pz = min{exp (—A(Y; — zz)) , 1}. 

(c) Set t =t + 1 and go to step (a). 

Run this chain until t = n, a suitably chosen large integer. Details on its 
convergence as well as why independent M-H is more efficient than accept- 
reject Monte Carlo can be found in Robert and Casella (1999). In our example, 
for x = 1, œ = 1, y = 49 and à = 100, we simulated such a Markov chain. The 
resulting frequency histogram is shown in Figure 7.1, with the true posterior 
density super-imposed on it. 


Example 7.13. In this example, we discuss the hierarchical Bayesian analysis 
of the usual one-way ANOVA. Consider the model 


Yij = i + cij J =1,...,njt=1,...,%; 
expr N Oo h] = esha = burk, (7.26) 


and are independent. Let the first stage prior on 6; and g? be such that they 
are i.i.d. with 


6,~ N(ug,07), i=1,...,k; 


re - 
o; ~ inverse Gamma(a;,b1), 27=1,...,Kk. 


r 2 . 
The second stage prior on yp, and gź is 


f n 
AA 
IAA 
line 


Fig. 7.1. M-H frequency histogram and true posterior density. 
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uz ~ N(uo,0%) and o2 ~ inverse Gamma(azg, b2). 


Here a1, @2, b1, b2, Ho, and o@ are all specified constants. Let us concentrate 
on computing 


u(y) = E" (Oly). 


Sufficiency reduces this to considering only 


= i l 
y= T it = taenesk and 


Ni 


S? =) Wg) aa 


j=l 
From normal theory, 


Y;|0,07 ~ N(0;,07/ni), ip ee 
which are independent and are also independent of 
S650 oo. vgs. 1 ldo 


which again are independent. To utilize the Gibbs sampler, we need the full 
conditionals of 1(0, 07, ur, oż|y). It can be noted that it is sufficient, and in 


fact advantageous to consider the conditionals, 

(i) m(Ola?, Hr, 07y), 

(ii) 7(0?|0, ur, O75), 

(iii) T(t, |07,0,07,y), and 

(iv) (02 |ux,0,07,y), 

rather than considering the set of all univariate full conditionals because of 
the special structure in this problem. First note that 


Olur, o2 m Nz (purl, 02Ip), 
and hence 


Olun, 02,07,y ~ Nz (pp, 5), where 


2 2 
us? Sa TE eee le [ni ln and 
i oa eo 27n Ji 2 2y, Hr 
o2 + 0o% [ni o2 +0% /ni 
2-2 
o; [n 


X® is diagonal with g% (7.27) 


02 +o/ni 
which determines (i). Next we note that, given 8, from (7.26), 9%? = 
ji (iz — 9:)° is sufficient for o7, and they are independently distributed. 


23 
Thus we have, 
*2| —2 D2 pa 
S; o 0 ~ Xni- 5 1,..., K, 


and are independent, and g? are i.i.d. inverse Gamma (a, b1), so that 
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2 2 1 ) l gs2 
o; lO, Ux,07,y ~ inverse Gamma(a, + zi bi + 5% ), (7.28) 


and they are independent for 7 = 1,...,k, which specifies (ii). Turning to the 
full conditional of 4x, we note oon the hierarchical structure that the condi- 
tional distribution of p,|02, 0,07, y is the same as the conditional distribution 
of ,|07,0. To determine this distribution, note that 


Oilur, o2 AON aoe) 


fori =1,...,k and are iid. and yu, ~ N (u0, 02). Therefore, treating 8 to be 
a random TORS from N (ur, o2), so that @ = Be ,-1 9i/k is sufficient for fr, 
we have the joint distribution, 


Olina ee N (tx, o%/k), and Ur ~ N (uo, o0). 
Thus we obtain, 


dg = a2 /k ee 


o E E el LY a A cy 
an a aon t 


[n\o7;0,07,y ~ N(= 
which provides (iji). Just as in the previous case, the conditional distribution 
of o7|u,,0,07,y turns out to be the same as the conditional distribution of 
o?|u,,98. To obtain this, note again that 


Oilur, o2 pi N Greco.) 


fori = 1,..., k and are i.id. so that this time SE (0 — ur Y is sufficient for 
o2. Further 


k 
N (6: — Hr)? loż ~ 02 x7 and o2 ~ inverse Gamma(az, b2), 
i=1 


so that 
o?lur,0,o?,y ~ inverse Gamma(ag +; „b2 + = DL (7.30) 


This gives us (iv), thus completing the specification of all the required full 
conditionals. It may be noted that the Gibbs sampler in this problem requires 
simulations from only the standard normal and the inverse Gamma distribu- 
tions. 


Reversible Jump MCMC 

There are situations, especially in model selection problems, where the MCMC 
procedure should be capable of moving between parameter spaces of differ- 
ent dimensions. The standard M-H algorithm described earlier is incapable 
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of such movements, whereas the reversible jump algorithm of Green (1995) is 
an extension of the standard M-H algorithm to allow exactly this possibility. 
The basic idea behind this technique as applied to model selection is as fol- 
lows. Given two models Mı and Mə with parameter sets 8; and @2, which are 
possibly of different dimensions, fill the difference in the dimensions by sup- 
plementing the parameter sets of these models. In other words, find auxiliary 
variables -y;2 and *y2; such that (81,12) and (@2,-y21) can be mapped with 
a bijection. Now use the standard M-H algorithm to move between the two 
models; for moves of the M-H chain within a model, the auxiliary variables 
are not needed. We sketch this procedure below, but for further details refer 
to Robert and Casella (1999), Green (1995), Sorensen and Gianola (2002), 
Waagepetersen and Sorensen (2001), and Brooks et al. (2003). 

Consider models M1, Mo,... where model M; has a continuous parameter 
space 0;. The parameter space for the model selection problem as a whole 
may be taken to be 


{(M;, 9;) :6;€0,;,1= eee 


Let f(x|M;,0;) be the model density under model M;, and the prior density 
be 
7(8) = mn 6;|M;)I (0 = 0; € 0i), 


where 7; is the prior probability of model M; and 7(6;|M;) is the prior density 
conditional on M; being true. ‘Then the posterior probability of any B C U;0; 


7(B|x) = D (0;|M;, x) dð;, 


where 


1(6;|M;, x) ox minbi Md f (x| M, 0;) 


is the posterior density restricted to M;. To compute the Bayes factor of Mk 
relative to Mı, we will need 


P®(My|x) m 
P™(M,|x) Tk i 


where 


Ti Jat 7 (8; |M;) f(x!M;, 0;) dé; 
D Nj Jo, T 0;|M;)f(x|M;, @;) dO; 


is the posterior probability of M;. Therefore, for the target density 7(0|x), 
we need a version of the M-H algorithm that will facilitate the above-shown 
computations. Suppose @; is a vector of length n;. It suffices to focus on 
moves between 6; in model M; and 8; in model M; with nj < n;. The 
scheme provided by Green (1995) is as follows. If the current state of the chain 
is (M;,0;), a new value (M;,0;) is proposed for the chain from a proposal 


P™(M;|x) = 
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(transition) distribution Q(@;,d0;), which is then accepted with a certain 
acceptance probability. To move from model M; to M}, generate a random 
vector V of length n; — n; from a proposal density 


Nj — Ti 


bis(v) = [| Vom). 


Identify an appropriate bijection map 
hij : O; x RIT” — Oi; 


and propose the move from @; to 0; using 0; = hi;(0:, V). The acceptance 
probability is then 


p((M;, 0;), (Mj, 0;)) = min or Oris (Oi, 0;)} ’ 


where 


} 


7O 2 (O5| Mi, x) 93 (91) Wig (v) | 20: v) 
with p,;(@;) denoting the (user-specified) probability that a proposed jump 
to model M; is attempted at any step starting from 8; € ©O;. Note that 
Dos Pij =e 


Example 7.14. For illustration purposes, consider the simple problem of com- 
paring two normal means as in Sorensen and Gianola (2002). Then, the two 
models to be compared are 








yil Mi, v, g? ~ N(v, o*),i = 1,2,... mı + Mə Li.d., 


N(v StS 2i 
a 1> ; E 1s 
yi| M2, Vi, V2, 0 Deoa +1,... mı + mə. 


To implement the reversible jump M-H algorithm we need the map, A12 taking 
(v,07,V) to (11, 2,07). A reasonable choice for this is the linear map 


Vi 10 1 V 
Vo = 1 0 —1 g? 
go? 01 0 V 


7.4.7 Convergence Issues 


As we have already seen, Monte Carlo sampling based approaches for inference 
make use of limit theorems such as the law of large numbers and the central 
limit theorem to justify their validity. When we add a further dimension to 
this sampling and adopt MCMC schemes, stronger limit theorems are needed. 
Ergodic theorems for Markov chains such as those given in equations (7.13) 
and (7.15) are these useful results. It may appear at first that this procedure 
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necessarily depends on waiting until the Markov chain converges to the target 
invariant distribution, and sampling from this distribution. In other words, one 
needs to start a large number of chains beginning with different starting points, 
and pick the draws after letting these chains run sufficiently long. This is 
certainly an option, but the law of large numbers for dependent chains, (7.13) 
says also that this is unnecessary, and one could just use a single long chain. 
It may, however, be a good idea to use many different chains to ensure that 
convergence indeed takes place. For details, see Robert and Casella (1999). 

There is one important situation, however, where MCMC sampling can 
lead to absurd inferences. This is where one resorts to MCMC sampling with- 
out realizing that the target posterior distribution is not a probability distri- 
bution, but an improper one. The following example is similar to the normal 
problem (see Exercise 13) with lack of identifiability of parameters shown in 
Carlin and Louis (1996). 


Example 7.15. (Example 7.11 continued.) Recall that, in this problem, (X | 
Y,@) ~ binomial(Y,@), where Y | A ~ Poisson(A). Earlier, we worked with 
a known mean A, but let us now see if it is possible to handle this problem 
with unknown A. Because Y is unobservable and only X is observable, there 
exists an ‘identifiability’ problem here, as can be seen by noting that X|@ ~ 
Poisson(\@). We already have the Beta(a,y) prior on 0. Suppose 0 < a < 1. 
Consider an independent prior on A according to which m(A) x (à > 0). 
Then, 


m(A, O|z) x exp(—A0)A707 t-11 — YT 0<O<1,A>0. (7.31) 


This joint density is improper because 


OO 1 
| J exp(—A9)A\797* 9-1 (1 — 0)? dà db 
0 0 


1 oO 
= J (| exp(—A@)A*® aa) 07ta—=1(1— 80) 1 d8 
0 0 


i I(x + 1) xta-—1 —1 


i 
afua 1) | 9°-2(1 — 97-1 do 
0 
= &. 


In fact, the marginal distributions are also improper. However, it has full con- 
ditional distributions that are proper: 


0,2 ~ Gamma (z + 1,0) and z(8|à, x) x exp(—A9)O7 4° 1 (1 — 0t. 


Thus, for example, the Gibbs sampler can be successfully employed with these 
proper full conditionals. To generate 6 from (|, x), one may use the inde- 
pendent M-H algorithm described in Example 7.12. Any inference on the 
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marginal posterior distributions derived from this sample, however, will be 
totally erroneous, whereas inferences can indeed be made on A0. 


In fact, the non-convergence of the chain encountered in the above example 
is far from being uncommon. Often when we have a hierarchical prior, the prior 
at the final stage of the hierarchy is an improper objective prior. ‘Then it is not 
easy to check that the joint posterior is proper. Then none of the theorems on 
convergence of the chains may apply, but the chain may yet seem to converge. 
In such cases, inference based on MCMC may be misleading in the sense of 
what was seen in the example above. 


7.5 Exercises 


1. (Flury and Zoppé (2000)) A total of m + n lightbulbs are tested in two 
independent experiments. In the first experiment involving n lightbulbs, 
the exact lifetimes 41,..., Yn of all the bulbs are recorded. In the second 
involving m lightbulbs, the only information available is whether these 
lightbulbs were still burning at some fixed time t > 0. This is known as 
right-censoring. Assume that the distribution of lifetime is exponential 
with mean 1/6, and use 7(0) x 67+. Find the posterior mode using the 
E-M algorithm. 

2. (Flury and Zoppé (2000)) In Problem 1, use uniform(0,@) instead of ex- 
ponential for the lifetime distribution, and 7(@) = I(o,.)(@). Show that 
the E-M algorithm fails here if used to find the posterior mode. 

3. (Inverse c.d.f. method) Show that, if the c.d.f. F(x) of a random vari- 
able X is continuous and strictly increasing, then U = F(X) ~ U(0, 1], 
and if V ~ U(0, 1], then Y = F~'(V) has c.d.f F. Using this show that if 
U ~ U[0,1], -InU/@ is an exponential random variable with mean 87+. 

4. (Box-Muller transformation method) Let U; and Uz be a pair of 
independent Uniform (0, 1) random variables. Consider first a transfor- 
mation to 

W = R? =-2logU;; V = 2rU3, 


and then let 
A = COs | OY Si sit V. 


Show that X and Y are independent standard normal random variables. 
5. Prove that the accept-reject Monte Carlo method given in Example 7.12 
indeed generates samples from the target density. Further show that the 
expected number of draws required from the ‘proposal density’ per obser- 
vation is c7}. 
6. Using the methods indicated in Exercises 1, 2, and 3 above, or combina- 
tions thereof, prove that the standard continuous probability distributions 


can be simulated. 
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Consider a discrete probability distribution that puts mass p; on point 
£i, i = 0,1,.... Let U ~ U(0,1), and define a new random variable Y as 


follows. 
Vie To if U < Po; l 
METOD < U < Dyno A 
What is the probability distribution of Y? 


. Show that the random sequence generated by the independent M-H algo- 


rithm is a Markov chain. 


. (Robert and Casella (1999)) Show that the Gamma distribution with 


a non-integer shape parameter can be simulated using the accept-reject 
method or the independent M-H algorithm. 

Gibbs Sampling for Multinomial. Consider the ABO Blood Group 
problem from Rao (1973). The observed counts in the four blood groups, 
O, A, B, and ABP are as given in Table 7.3. Assuming that the inheritance 
of these blood groups is controlled by three alleles, A, B, and O, of which 
O is recessive to A and B, there are six genotypes OO, AO, AA, BO, BB, 
and AB, but only four phenotypes. If r, p, and q are the gene frequencies 
of O, A, and B, respectively (with p +q +r = 1), then the probabilities 
of the four phenotypes assuming Hardy-Weinberg equilibrium are also as 
shown in Table 7.3. Thus we have here a 4-cell multinomial probability 
vector that is a function of three parameters p,g,r with p+q+r = 1. 
One may wish to formulate a Dirichlet prior for p,q,r. But it will not 
be conjugate to the 4-cell multinomial likelihood function in terms of 
p,q,r from the data, and this makes it difficult to work out the posterior 
distribution of p,g,r. Although no data are missing in the real sense of 
the term, it is profitable to split each of the n4 and ng cells into two: 
na into naa nao With corresponding probabilities p?,2pr and ng into 
NBB,NBO With corresponding probabilities q, 2gr, and consider the 6-cell 
multinomial problem as a complete problem with n44,ngp as ‘missing’ 
data. 


Table 7.3. ABO Blood Group Data 
Cell Count| Probability 





p? + 2pr 
q? + 2qr 
2pq 


Let N = no+n4+np+t+ nap, and denote the observed data by n = 
(no na, nB, naBn). Consider estimation of p,q,r using a Dirichlet prior 
with parameters a, 8, y with the ‘incomplete’ observed data n. 

The likelihood upto a multiplicative constant is 


L(p, q,r) = 7?" (p? + 2pr)”4(q? + 2gr)”® (pq)”4?. 


11. 
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The posterior density of (p,q,r) given n is proportional to 
peroty—1 (p? + 2pr)rA (q? 4+ Qqr)"® (p)"48 ta-l(g)\naB +8-1 


Let na = naa + nao, nB = NBBt+ NBO, and write noo for no. Verify 
that if we have the ‘complete’ data, n = (noo, NAA, NAO, BB, NBO, NAB), 
then the likelihood is, upto a multiplicative constant 


T RTA (Qgr)"P° (Apr)? 4° 


TL nt 
= p aq By oO, 
where i 1 
ni = NAA + 5 AB + 9 AO 
+_1 1 
np = 5 AB +NBBT 9 BO 
, 1 1 
no = 5 AO ar 5 BO + noo. 


Show that the posterior distribution of (p,q,r) given n is Dirichlet with 
parameters ni +Q&-—l1, n5 +8-1, Tie; +-+y—1, when the prior is Dirichlet 
with parameters (a, 8, Y). 

Show that the conditional distributions of (n44, ngr) given n and (p,q,r) 
is that of two independent binomials: 


2 


ee, 
p? + 2pr”’ 


(naan, p,q, r) ~ binomial(na, 


), and 


2 
; . q 
~b l mim 
(ngBn, p, q,r) inomial(npg, re 
(p,9,7|0,N4A,NBB) ~ Dirichlet(n$ +a -— LaF +8— Lag +y — 1). 
Show that the Rao-Blackwellized estimate of (p, q,r) from a Gibbs sample 
of size m is 


m 


1 | | l 
-P onii p tnh y +nie N), 
{=l 


where the superscript 2 denotes the ith draw. 

(M-H for the Weibull Model: Robert (2001)). The following twelve 

observations are from a simulated reliability study: 

0.56, 2.26, 1.90, 0.94, 1.40, 1.39, 1.00, 1.45, 2.32, 2.08, 0.89, 1.68. 

A Weibull model with the following density form is considered appropriate: 
f(zla,n) x anz le "?* 0 < 2 < o, 


with parameters (œ, n). Consider the prior distribution 
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nlan) «x enile E., 


The posterior distribution of (a,7) given the data (£1, £2,..., £n) has 
density 


TQ, nlzi, U2 ++. on) Oo (on)"(T I aS exp -r Soar) ma, n). 


To get a sample from the posterior density, one may use the M-H algo- 
rithm with proposal density 


1 ay! 
a’, n'|a,7) = — exp { -— —— >, 
qa'la) = © exp | 


which is a product of two independent exponential distributions with 
means a, 7. Compute the acceptance probability p((a’,7’), (a, 7™)) 
at the tth step of the M-H chain, and explain how the chain is to be 
generated. 

Complete the construction of the reversible jump M-H algorithm in Exam- 
ple 7.14. In particular, choose an appropriate prior distribution, proposal 
distribution and compute the acceptance probabilities. 

(Carlin and Louis (1996)) Suppose y1, yo,..., Yn is an i.i.d. sample with 


Yi |O1, 02 Di N (0: Ir 02,07), 


where g? is assumed to be known. Independent improper uniform prior 


distributions are assumed for 6; and @o. 
(a) Show that the posterior density of (61, 92|y) is 


7 (01, O2|y) x exp(—n(O1 + 02 — ¥)*/(207))I((01, 42) E€ R’), 


which is improper, integrating to oo (over R?)). 

(b) Show that the marginal posterior distributions are also improper. 

(c) Show that the full conditional distributions of this posterior distribu- 
tion are proper. 

(d) Explain why a sample generated using the Gibbs sampler based on 
these proper full conditionals will be totally useless for any inference on the 
marginal posterior distributions, whereas inferences can indeed be made 
on 6; + 02. 

Suppose X1, X2,..., Xk are independent Poisson counts with X; having 
mean §;. 0; are a priori considered related, but exchangeable, and the prior 


k 
7(O1, 0k) x (1+ ` ge), 
1=1 
is assumed. 
(a) Show that the prior is a conjugate mixture. 
(b) Show how the Gibbs sampler can be employed for inference. 
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15. Suppose X1, X2,..., Xn are i.i.d. random variables with 
X;|A1, A2 ~ exponential with mean 1/A1A2, 


and independent scale-invariant non-informative priors on A; and Ag are 
used. i.e., mA, A2) x (AAD THA > 0,A2 > 0). 

(a) Show that the marginals of the posterior, m(À1, A2|x) are improper, 
but the full conditionals are standard distributions. 

(b) What posterior inferences are possible based on a sample generated 
from the Gibbs sampler using these full conditionals? 
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Some Common Problems in Inference 


We have already discussed some basic inference problems in the previous 
chapters. These include the problems involving the normal mean and the 
binomial proportion. Some other usually encountered problems are discussed 
in what follows. 


8.1 Comparing Two Normal Means 





Investigating the difference between two mean values or two proportions is a 
frequently encountered problem. Examples include agricultural experiments 
where two different varieties of seeds or fertilizers are employed, or clinical 
trials involving two different treatments. Comparison of two binomial propor- 
tions was considered in Example 4.6 and Problem 8 in Chapter 4. Comparison 
of two normal means is discussed below. 

Suppose the model for the available data is as follows. Y11,..., Yin, is 
a random sample of size nı from a normal population, N(@,,07), whereas 
Yo1,---, Yan, is an independent random sample of size ng from another normal 
population, N (62,02). All the four parameters 6, 02, 07, and oĉ are unknown, 
but the quantity of inferential interest is 7 = 6; — 62. 

It is convenient to consider the case, o? = 03 = a” separately. In this case, 
(Y1, Yo, 8”) is jointly sufficient for (61,02,07) where s? = (S072, (Yui — Y1)? + 
Dee oy — Y2)?)/(nı + n2 — 2). Further, given (61, 62,07), 


2 2 
A C = Oo 
Yi ~ N (0), L ma N (62, ee) and (nı Ge > as 2)s? = a ETET 


and they are independently distributed. Upon utilizing the objective prior, 
1(01,02,07) x o~*, one obtains 


1(91,02,07| data) = 7(0;|o7, y1)7(02|07, G2) 4(07|87), 


and hence 
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T(n, o°| data) = m(nlo”, 1, J2)t(o7|s7). (8.1) 


Now, note that 
J = = = ia 2 1 1 
no”, j, J2 ~ N (Ji — y2,0°(— + —)). 
Thy nə 
Consequently, integrating out o° from (8.1) yields, 


_ = (nitne—-1)/2 
(nı + ng — 2)s?( 5 + + i l 


n2 


1(n| data) œx fı + 


or, equivalently 

n — (Jı — Y2) 
See ees S 
Thi n2 


| data ~ tni+n2—2- 
S 


In many situations, the assumption that o? = o% is not tenable. For ex- 


ample, in a clinical trial the populations corresponding with two different 
treatments may have very different spread. This problem of comparing means 
when we have unequal and unknown variances is known as the Behrens-Fisher 
problem, and a frequentist approach to this problem has already been dis- 
cussed in Problem 17, Chapter 2. We discuss the Bayesian approach now. We 
have that (Y1, s?) is sufficient for (01, o2) and (Y2, s2) is sufficient for (02, 02), 
where s? = 30" (Yy — Yi)?/(ni — 1), i = 1,2. Also, given (61,62, 07,03), 


2 
Yi ~ N (0i, J and (n; — 1)s; i Ti Xni- i = 1,2, 
i 


and further, they are all independently distributed. Now employ the objective 
prior 
1(0,,02,07,02) x 07°03, 
and proceed exactly as in the previous case. It then follows that under the 
posterior distribution also 6; and @2 are independent, and that 
nilh — ¥ nolb — J 
(0: ) yin2(O2 ~ Ya) data ~ t,,-1. (8.3) 
1 


ble Ce eRe data ~t,,-1 and 
52 


S 
It may be immediately noted that the posterior distribution of 7 = 6, — b2, 
however, is not a standard distribution. Posterior computations are still quite 
easy to perform because Monte Carlo sampling is totally straightforward. Sim- 
ply generate independent deviates 6; and 92 repeatedly from (8.3) and utilize 
the corresponding 7 = 0, — @2 values to investigate its posterior distribution. 
Problem 4 is expected to apply these results. 

Extension to the k-mean problem or one-way ANOVA is straightforward. A 
hierarchical Bayes approach to this problem and implementation using MCMC 
have already been discussed in Example 7.13 in Chapter 7. 
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8.2 Linear Regression 


We encountered normal linear regression in Section 5.4 where we discussed 
prior elicitation issues in the context of the problem of inference on a response 
variable Y conditional on some predictor variable X. Normal linear models 
in general, and regression models in particular are very widely used. We have 
already seen an illustration of this in Example 7.13 in Section 7.4.6. We intend 
to cover some of the important inference problems related to regression in this 
section. 
Extending the simple linear regression model where E(Y |G, G1, X =x) = 
Bo + Bix to the multiple linear regression case, E(Y |G, X = x) = 8’x, yields 
the linear model 
y=XG+e, (8.4) 


where y is the n-vector of observations, X the n x p matrix having the ap- 
propriate readings from the predictors, G the p-vector of unknown regression 
coefficients, and e the n-vector of random errors with mean 0 and constant 
variance 0”. The parameter vector then is (3,07), and most often the statis- 
tical inference problem involves estimation of G and also testing hypotheses 
involving the same parameter vector. For convenience, we assume that X has 
full column rank p < n. We also assume that the first column of X is the 
vector of 1’s, so that the first element of B, namely £1, is the intercept. 

If we assume that the random errors are independent normals, we obtain 
the likelihood function for (3,07) as 


ftyid.o%) = |->] ew {0 -xe)'y - x0} 
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= A oes l ~; |v -9)'(y -+ (8 - B)'X'x(8 - A) } (8.5) 
where 
Ê= (X'X)“t X'y, and ŷ = XB. 


It then follows that Ĝ is sufficient for 8 if o? is known, and (B, (y—¥)'(y—J)) 
is jointly sufficient for (3,07). Further, 


Blo? ~ Ny(B,0?(X'X)~*) 
and is independent of 
(y = $V y —H))lo? ~ O° Xap 


We take the prior, 
1 
2 
7(G,o ) OX a (8.6) 


This leads to the posterior, 
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n(B,07|y) = 1(B|B,07)x(o7|(y — ¥)'(y - 9). (8.7) 
It can be seen that . : 
BIB, o? ~ N,(B,0°7(X'X)~*) 


and that the posterior distribution of ø? is proportional to an inverse x?_,. 
Integrating out o? from this joint posterior density yields the multivariate t 
marginal posterior density for ĝ, i.e., 


m(Bly) 

I'(n/2)|X'X|1/2s~? 
(P(1/2))PP((n — p)/2)(/n — p}? 
where s* = (y — y)'(y — ¥)/(n — p). From this, it can be deduced that the 


posterior mean of 8 is B ifn > p+ 2, and the 100(1-a)% HPD credible region 
for 8 is given by the ellipsoid 


(n — p)s° 





{8 : (B -ÈY X'X(B — È) < ps?Fpm-pla) Y, (8.9) 


where Fp,n-p(@) is the (1 — a) quantile of the Fp,n-p distribution. Further, 
if one is interested in a particular ĝ;, the fact that the marginal posterior 
distribution of 8; is given by 


ie eee (8.10) 
djj 


where dj; is the jth diagonal entry of (X’X)~!, can be used. 

Conjugate priors for the normal regression model are of interest espe- 
cially if hierarchical prior modeling is desired. This discussion, however, will 
be deferred to the following chapters where hierarchical Bayesian analysis is 
discussed. 


Example 8.1. Table 8.1 shows the maximum January temperatures (in degrees 
Fahrenheit), from 1931 to 1960, for 62 cities in the U.S., along with their lat- 
itude (degrees), longitude (degrees) and altitude (feet). (See Mosteller and 
Tukey, 1977.) It is of interest to relate the information supplied by the geo- 
graphical coordinates to the maximum January temperatures. 

The following summary measures are obtained. 


62.0 2365.0 5674.0 56012.0 
2365.0 92955.0 217285.0  2244586.0 
5674.0 217285.0 538752.0  5685654.0 i 
56012.0 2244586.0 5685654.0 1.7720873 x 108 


X'X = 
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Table 8.1. Maximum January Temperatures for U.S. Cities, with Latitude, Longi- 


tude, and Altitude 


City Latitude Longitude Altitude Max. Jan. Temp 
Mobile, Ala. 30 88 5 61 
Montgomery, Ala. 32 86 160 59 
Juneau, Alaska 58 134 50 30 
Phoenix, Ariz. 33 112 1090 64 
Little Rock, Ark. 34 92 286 51 
Los Angeles, Calif. 34 118 340 65 
San Francisco, Calif. 37 122 65 55 
Denver, Col. 39 104 5280 42 
New Haven, Conn. 41 T2 40 37 
Wilmington, Del. 39 79 135 41 
Washington, D.C. 38 te 25 44 
Jacksonville, Fla. 38 81 20 67 
Key West, Fla. 24 81 5 74 
Miami, Fla. 25 80 10 76 
Atlanta, Ga. 33 84 1050 52 
Honolulu, Hawaii 21 157 21 79 
Boise, Idaho 43 116 2704 36 
Chicago, Ill. 41 87 595 33 
Indianapolis, Ind. 39 86 710 3T 
Des Moines, Iowa 41 93 805 29 
Dubuque, Iowa 42 90 620 27 
Wichita, Kansas 37 97 1290 42 
Louisville, Ky. 38 85 450 44 
New Orleans, La. 29 90 5 64 
Portland, Maine 43 70 25 32 
Baltimore, Md. 39 76 20 44 
Boston, Mass. 42 71 21 37 
Detroit, Mich. 42 83 585 33 
Sault Sainte Marie, Mich. 46 84 650 23 
Minneapolis -St. Paul, Minn. 44 93 815 22 
St. Louis, Missouri 38 90 455 40 
Helena, Montana 46 112 4155 29 
Omaha, Neb. 4] 95 1040 32 
Concord, N.H. 43 71 290 32 
Atlantic City, N.J. 39 74 10 43 
Albuquerque, N.M. 35 106 4945 46 
continues 

94883.1914 —1342.5011 —485.0209 2.5756 

(X'X)71 — 19-5 —1342.5011 37.8582 —0.8276 —0.0286 

—485.0209 —0.8276 5.8951 —0.0254 

2.5756 —0.0286 —0.0254 0.0009 
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Table 8.1 continued 


City Latitude Longitude Altitude Max. Jan. Temp 
Albany, N.Y. 42 73 20 31 
New York, N.Y. 40 73 55 40 
Charlotte, N.C. 35 80 720 51 
Raleigh, N.C. 35 78 365 52 
Bismark, N.D. 46 100 1674 20 
Cincinnati, Ohio 39 84 550 Al 
Cleveland, Ohio 4l 81 660 35 
Oklahoma City, Okla. 35 97 1195 46 
Portland, Ore. 45 122 77 44 
Harrisburg, Pa. 40 76 365 39 
Philadelphia, Pa. 39 T5 100 40 
Charlestown, S.C. 32 79 9 61 
Rapid City, S.D. 44 103 3230 34 
Nashville, Tenn. 36 86 450 49 
Amarillo, Tx. 35 101 3685 50 
Galveston, Tx. 29 94 5 61 
Houston, Tx. 29 95 40 64 
Salt Lake City, Utah 40 111 4390 37 
Burlington, Vt. 44 73 110 25 
Norfolk, Va. 36 76 10 50 
Seattle-Tacoma, Wash. 47 122 10 44 
Spokane, Wash. 47 117 1890 31 
Madison, Wisc. 43 89 860 26 
Milwaukee, Wisc. 43 87 635 28 
Cheyenne, Wyoming 41 104 6100 37 
San Juan, Puerto Rico 18 66 35 81 
2739.0 100.8260 
i 99168.0 2 —1.9315 
X'y = 259007.0 A= 0.2033 [> and s = 6.05185. 


2158463.0 —0.0017 


On the basis of the analysis explained above, B may be taken as the es- 
timate of B and the HPD credible region for it can be derived using (8.9). 
Suppose instead one is interested in the impact of latitude on maximum Jan- 
uary temperatures. Then the 95% HPD region for the corresponding regres- 
sion coefficient G2 can be obtained using (8.10). This yields the t-interval, 
Bo + s\/dogtsg(.975), or (-2.1623, -1.7007), indicating an expected general drop 
in maximum temperatures as one moves away from the Equator. If the joint 
impact of latitude and altitude is also of interest, then one would look at the 
HPD credible set for (82, 34). This is given by 


{ (Ga Ba) : (B2 + 1.9315, By + 0.0017)C~* (Bg + 1.9315, B4 + 0.0017)’ 


= 28” Fa,s9(a) }, 
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Fig. 8.1. Plot of 95% and 99% HPD credible regions for (82, 84). 


where C is the appropriate 2 x 2 block from (X’'X)+, 


C= 10-4 3.7858 —2.8636 x 1073 
> —2.8636 x 107? 9.2635 x 107° 


Plotted in Figure 8.1 are the 95% and 99% HPD credible regions for 
(32, 84). Impact of altitude on maximum temperatures seems to be very lim- 
ited for the case under consideration. 


Literature on Bayesian approach to linear regression is very large. Some 
of this material relevant to the discussion given above may be found in Box 
and Tiao (1973), Leamer (1978), and Gelman et al. (1995). 


8.3 Logit Model, Probit Model, and Logistic Regression 


We consider a problem here that is related to linear regression but actually 
belongs to a broad class of generalized linear models. This model is useful 
for problems involving toxicity tests and bioassay experiments. In such ex- 
periments, usually various dose levels of drugs are administered to batches of 
animals. Most often the responses are dichotomous because what is observed 
is whether the subject is dead or whether a tumor has appeared. This leads to 
a setup that can be easily understood in the context of the following example. 
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Example 8.2. Suppose that k independent random variables y1, yo,..., yp are 
observed, where y; has the B(n;, pi) probability distribution, 1 < i < k. yi 
may be the number of laboratory animals cured of an ailment in an experiment 
involving n; such animals. It is certainly possible to make inference on each p; 
separately based on the observed y; (as discussed previously). This, however, 
is not really useful if we want to predict the results of a similar experiment in 
future. Suppose that the p; are related to a covariate or an explanatory vari- 
able, such as dosage level in a clinical experiment. Then the natural approach 
is regression as described in the previous problem, because this allows us to 
explore and present the relationship between design (explanatory) variables 
and response variables, and (if needed) predictions of response at desired levels 
of the explanatory variables. Let t; be the value of the covariate that corre- 
sponds with p;, i = 1,2,...,k. Because p;’s are probabilities, linking them 
to the corresponding t;’s through a linear map as was done earlier does not 
seem appropriate now. Instead it can be made through a link function H such 
that p; = H (Bo + Bit;). H, here, is a known cumulative distribution function 
(c.d.f.) and o and () are two unknown parameters. (If H is an invertible 
function, this is precisely H~1(p;) = Bo + Gi t;.) If the standard normal c.d_f. 
is used for H, the model is called the probit model, and if the logistic c.d.f. 
(i.e., H(z) = e~?/(1+e7%)) is used, it is called the logit model. The likelihood 
function for the unknown parameters, ĝo and (1, is then given by 


k 
II (i) H (Bo + Bit)” (1 — H (Bo + hitti). 
t=1 


Suppose 7(@o, £1) is the prior density on (8o, 81) so that the posterior density 
1S 


(Bo, b1) Te, H(Bo + Bits) (1 — H(Bo + Bits)" 


(Po, Prldata) = TTR, H(a+ bt)" (1—H(a+ bt)" dadd 


It may be noted that the sample size n; and the covariates t; (dose level) are 
treated here as fixed, or equivalently the model conditional on those variables 
is analyzed (as in the linear regression problem). 


More generally, instead of a single covariate t, if we have a set of s covariates 
represented by x and the corresponding vector of coefficients 8, the posterior 
density of @ will be 


k 
r(Bldata) x 7(B) II H(Q'x,)¥ (1 — H(8’x,))" ”. (8.11) 
2=1 
8.3.1 The Logit Model 


If we use the logit model whereby p; = exp{—(’x;}/{1 + exp{—’x;}}, and 
hence — log(p;/(1 — pi)) = B’x:, we obtain the likelihood function 
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exp{—Q’x;} \” / —(ni—yi) 
(8) x |] (ee) (1 + exp{—S'x;}) 
i=1 ý 
k k 
Ba (05 va) [IO +ex(-6'x))™ (8.12) 
i=l 


i=1 


which is largely intractable (but see Problem 7). Usually a flat prior such as 
m((3) x 1 is employed, but because G can be considered to be regression coef- 
ficients under reparameterization, an approximate conjugate normal prior can 
also be used. In such a case, a hierarchical prior structure is also meaningful. 
To motivate the approximate conjugate hierarchical prior structure, con- 
sider the following large sample approximation. For simplicity, let there be 
only one covariate t. Assume that the n; are large enough for a satisfactory 
normal approximation of the binomial model. If we let p; = y;/n;, then (ap- 
proximately) these p; are independent N (p;, pi(1 — p;)/n;) random variates. 
Now let 6; = —log(p;/(1 — p;)) and 6; = —log(p,;/(1 — ĵ:)). It can be shown 
that, approximately, (6; — 6;),/nip;(1 — p;) are independent N(0,1) random 
variates. Then (again approximately), the likelihood function for (6o, 81) is 


k 
(Bo, 31|\data) = exp (-: X wil; — (Bo + at (8.13) 
i=1 


where w; = n;p;(1 — pi) are known weights. This suggests a bivariate normal 
prior on (8o, 81) as the first level in the hierarchical structure. Now the prob- 
lem is very similar to that discussed in Section 5.4. Further, there is also an- 
other important use for the approximate likelihood in (8.13). Its product with 
the conjugate normal prior discussed above can be used as a natural proposal 
density for the M-H algorithm in the computations required for inferential 
purposes (see Problem 9). If instead a flat prior on 8 is to be employed, then 
(8.13) itself (up to a constant) can be used as the proposal density. 


Example 8.3. (Example 8.2 continued). The data given in Table 8.2 is from 
Finney (1971) (which originally appeared in Martin (1942)) where results of 
spraying rotenone of different concentrations on the chrysanthemum aphids 
in batches of about fifty are presented. The concentration is in milligrams per 
liter of the solution and the dosage x is measured on the logarithmic scale. 
The median lethal dose LD50, the dose at which 50% of the subjects will 
expire, is one of the quantities of inferential interest. 

The plot of p = y/n against x shown in Figure 8.2 is S-shaped, so a linear 
fit for p based on x is unsatisfactory. Instead, as suggested by Figure 8.3, 
the logistic regression is more appropriate here. Suppose that a flat prior on 
(Bo, 81) is to be used. Then the implementation of M-H algorithm as explained 
above using the bivariate normal proposal density is straightforward. A scatter 
plot of 1000 values of (8o, 81) so obtained is shown in Figure 8.4. 
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Fig. 8.2. Plot of proportion of deaths against dosage level. 
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Fig. 8.3. Plot of logistic response function: e~°t’*/(1 + e7°*"*). 
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Table 8.2. Toxicity of Rotenone 


Dose (x; = log(c;))|Batch Size (n;)|Deaths (y;) 
50 


Concentration (c;) 













2.6 6 

3.8 48 16 

5.1 46 24 
49 






50 








t 
i 
D 
| 
-A 
i 
pp 
‘i 
oO 
CO 
| 
D 
T 
3 4 T G 7 
heta_O 


Fig. 8.4. Scatter plot of 1000 (40, 81) values sampled using M-H algorithm. 
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A comparison of the estimates of 8 and 8ı obtained using MLE and 


posterior means are shown in Table 8.3. 


Histograms of the M-H samples of G9 and (, are shown in Figure 8.5 
and Figure 8.6, respectively. They seem to be skewed and hence the posterior 


estimates seem more appropriate. 


Let us consider the estimation of LD50 next. Note that for the logit 
model LD50 is that dosage level tọ for which E(y;/n;\t; = to) = exp(—(fo + 
Gito))/(1 + exp(—(6o + Gito))) = 0.5. This means that LD50 = —8o/81. We 


Table 8.3. Estimates of 6o and 81 from Different Methods 


Method he eee A 
MLE from logistic regression| 4.826 | —_|-7.065] 





Posterior estimates 4.9727| 0.6312 |-7.266| 0.8859 
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Fig. 8.5. Histogram of 1000 8o values sampled using M-H algorithm. 








Fig. 8.6. Histogram of 1000 8ı values sampled using M-H algorithm. 
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can easily estimate this using our M-H sample, and we obtain an estimate of 
0.6843 (in the logarithmic scale) with a standard error of 0.022. 


8.3.2 The Probit Model 


As mentioned previously, if the standard normal c.d.f., ® is used for the link 
function H above, we obtain the probit model. Then, assuming that 7(Q) is 
the prior density on @ the posterior density is obtained as 


k 
r(Bldata) x (8) | | (3’x:)™ (1 — 6(8’x,))"™*. (8.14) 


4=1 


Analytically, this is even less tractable than the posterior density for the logit 

model. However, the following computational scheme developed by Albert and 

Chib (1993) based on the Gibbs sampler provides a convenient strategy. 
First consider the simpler case involving Bernoulli y;’s, i.e., n; = 1. Then, 


k 
m(Aldata) « m(8) |] 8(8'x:)" (1 — G(8'x,))" . 


1=1 


The computational scheme then proceeds by introducing k independent latent 
variables Z1, Z2, ...,Z,, where Z; ~ N(Q'x;,1). If we let Y; = I(Z; > 0), 
then Y),...,Y, are independent Bernoulli with p; = P(Y; = 1) = ®(0’x;,). 
Now note that the posterior density of 8 and Z = (Z1,..., Zk) given y = 
(Y1,+-+5 Yk) is 


k 
m(3, Zly) x w(B) | [ {14 > Oy = 1) + (Z: < Oy: = 0)} (Z: — B'xi), 


i=1 

(8.15) 

where ¢ is the standard normal p.d.f. Even though (8.15) is not a joint density 

which allows sampling from it directly, it allows Gibbs sampler to handle it 
since 7(3/Z, y) and 7(Z|8, y) allow sampling from them. It is clear that 


k 
m (|Z, y) OC 7m (3) lI HZ; = B'xi), (8-16) 


i=1 
whereas 


N(Q’x;,1) truncated at the left by 0 if y; = 1; 
ZilB, y ~ m 1) truncated at the right by 0 if y; = 0. (8.17) 
Sampling Z from (8.17) is straightforward. On the other hand, (8.16) is simply 
the posterior density for the regression parameters in the normal linear model 
with error variance 1. Therefore, if a flat noninformative prior on ĝ is used, 
then 
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Table 8.4. Lethality of Chloracetic Acid 


Dose |Fatalities Fatalities 
; 1 : 





BIZ y ~ N,(Bz, (X'X)~4), 


where X = (x},...,x/,)’ and Bz = (X’X)~'X’Z. If a proper normal prior 
is used a different normal posterior will emerge. In either case, it is easy to 
sample @ from this posterior conditional on Z. 

Extension of this scheme to binomial counts Y1, Y2,..., Yp is straightfor- 
ward. We let Y; = pea Yi; where Y;; = I(Z;; > 0), with Zi; ~ N(Q'x;, 1) are 
independent, 7 = 1,2,...,n;,2 =1,2,...,k. We then proceed exactly as above 
but at each Gibbs step we will need to generate 5)... ni many (truncated) 
normals Z;;. This procedure is employed in the following example. 


Example 8.4. (Example 8.2 continued). Consider the data given in Ta- 
ble 8.4, taken from Woodward et al. (1941) where several data sets on toxicity 
of certain acids were reported. This particular data set examines the relation- 
ship between exposure to chloracetic acid and the death of mice. At each of 
the dose levels (measured in grams per kilogram of body weight), ten mice 
were exposed. The median lethal dose LD50 is again one of the quantities of 
inferential interest. 

The Gibbs sampler as explained above is employed to generate a sample 
from the posterior distribution of (8o, 1) given the data. A scatter plot of 
1000 points so generated is shown in Figure 8.7. From this sample we have 
the estimate of (-1.4426, 5.9224) for (o, 81) along with standard errors of 
0.4451 and 2.0997, respectively. 

To estimate the LD50, note that for the probit model LD50 is the dosage 
level to for which E'(y;/n;|t; = to) = (80+ Fito) = 0.5. As before, this implies 
that LD50 = —$9/(,. Using the sample provided by the Gibbs sampler we 
estimate this to be 0.248. 


8.4 Exercises 
1. Show how a random deviate from the Student’s t is to be generated. 


2. Construct the 95% HPD credible set for 6; — 8z for the two-sample problem 
in Section 8.1 assuming o? = 02. 
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Fig. 8.7. Scatter plot of 1000 (fo, 61) values sampled using Gibbs sampler. 





. Show that Student’s ¢ can be expressed as scale mixtures of normals. 
Using this fact, explain how the 95% HPD credible set for 6; — 62 can be 
constructed for the case given in (8.3). 

. Consider the data in Table 8.5 from a clinical trial conducted by Mr. S. Sahu, 
a medical student at Bangalore Medical College, Bangalore, India (per- 
sonal communication). The objective of the study was to compare two 
treatments, surgical and non-surgical medical, for short-term management 
of benign prostatic hyperplasia (enlargement of prostate). The random 
observable of interest is the ‘improvement score’ recorded for each of the 
patients by the physician, which we assume to be normally distributed. 
There were 15 patients in the non-surgical group and 14 in the surgical 
group. 


Table 8.5. Improvement Scores 


Svga pea espeo 





9 
Non-surgical] 6 [8| 7/4] 4/6]8/3] 7/8) 9/6) 3 [6/4 


Apply the results from Problems 2 and 3 above to make inferences about 
the difference in mean improvement in this problem. 
. Consider the linear regression model (8.4). 


(a) Show that (B, (y —y)'(y — y)) is jointly sufficient for (8, o°). 
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10. 


11. 
12. 


13. 
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(b) Show that Blo? ~ Np(8,07(X’X)—*), (y = $) (y = $)? ~ 0? Xp 
and they are independently distributed. 

(c) Under the default prior (8.6), show that Gly has the multivariate t 
distribution having density (8.8). 


. Construct 95% HPD credible set for (82, 83) in Example 8.1. 
. Consider the model given in (8.12). 


(a) What is a sufficient statistic for 3? 
(b) Show that the likelihood equations for deriving MLE of G must satisfy 


k j k 
ni exp{—p"xi} , 
t7 — 4 YE) = yee 
` I +exp{-8'x;} ” LY Xij I 


i=1 


. Justify the approximate likelihood given in (8.13). 
. Consider a multivariate normal prior on @ for Problem 7. 


(a) Explain how the M-H algorithm can be used for computing the pos- 
terior quantities. 

(b) Compare the above scheme with an importance sampling strategy 
where the importance function is proportional to the product of the nor- 
mal prior and the approximate normal likelihood given in (8.13). 

Apply the results from Problem 9 to Example 8.3 with some appropriate 
choice for the hyperparameters. 

Justify (8.16) and (8.17), and explain how Gibbs sampler handles (8.15). 
Analyze the problem in Example 8.4 with an additional data point having 
9 fatalities at the dosage level of 0.3400. 

Analyze the problem in Example 8.4 using logistic regression. Compare 
the results with those obtained in Section 8.3.2 using the probit model. 


9 


High-dimensional Problems 


Rather than begin by defining what is meant by high-dimensional, we begin 
with a couple of examples. 


Example 9.1. (Stein’s example) Let N(ptpx1, Ypxp = 07Ipxp) be a p-variate 
normal population and X; = (Xi1,...,Xip), t= 1,...,n be n iid. p-variate 
samples. Because X = o7I, we may alternatively think of the data as p in- 
dependent samples of size n from p univariate normal populations N (uj, 07), 
j= 1,...,p. The parameters of interest are the y;’s. For convenience, we ini- 
tially assume a? is known. Usually, the number of parameters, p, is large and 
the sample size n is small compared with p. These have been called problems 
with large p, small n. Note that n in Stein’s example is the sample size, if 
we think of the data as a p-variate sample of size n. However, we could also 
think of the data as univariate samples of size n from each of p univariate 
populations. ‘Then the total sample size would be np. The second interpreta- 
tion leads to a class of similar examples. Note that the observations are not 
exchangeable except in subgroups, in this sense one may call them partially 
exchangeable. 


Example 9.2. Let f(x|u;), j = 1,...,p, denote the densities for p populations, 
and X43;,2 = 1,...,n,7 = 1,...,p denote p samples of size n from these 
p populations. As in Example 9.1, f(x|u;) may contain additional common 
parameters. ‘The object is to make inference about the j;’s. 


In several path-breaking papers Stein (1955), James and Stein (1960), 
Stein (1981), Robbins (1955, 1964), Efron and Morris (1971, 1972, 1973a, 
1975) have shown classical objective Bayes or classical frequentist methods, 
e.g., maximum likelihood estimates, will usually be inappropriate here. See 
also Kiefer and Wolfowitz (1956) for applications to examples like those of 
Neyman and Scott (1948). These approaches are discussed in Sections 9.1 
through 9.4, with stress on the parametric empirical Bayes (PEB) approach 
of Efron and Morris, as extended in Morris (1983). 
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It turns out that exchangeability of 41,..-,Hp plays a fundamental role 
in all these approaches. Under this assumption, there is a simple and natural 
Bayesian solution of the problem based on a hierarchical prior and MCMC. 
Much of the popularity of Bayesian methods is due to the fact that many new 
examples of this kind could be treated in a unified way. 

Because of the fundamental role of exchangeability of 4;’s and the sim- 
plicity, at least in principle, of the Bayesian approach, we begin with these 
two topics in Section 9.1. This leads in a natural way to the PEB approach 
in Sections 9.2 and 9.3 and the frequentist approach in Section 9.4. 

All the above sections deal with point or interval estimation. In Section 9.6 
we deal with testing and multiple testing in high-dimensional problems, with 
an application to microarrays. High-dimensional inference is closely related to 
model selection in high-dimensional problems. A brief overview is presented 
in Sections 9.7 and 9.8. We discuss several general issues in Sections 9.5 and 
9.9. 


9.1 Exchangeability, Hierarchical Priors, Approximation 
to Posterior for Large p, and MCMC 


We follow the notations of Example 9.1 and Example 9.2, i.e., we consider 
p similar but not identical populations with densities f(z|ui),..., f(z|bp). 
There is a sample of size n, X1;,...,Xnj, from the jth population. These p 
populations may correspond with p adjacent small areas with unknown per 
capita income 41,...,/4), aS in small area estimation, Ghosh and Meeden 
(1997, Chapters 4, 5). They could also correspond with p clinical trials in a 
particular hospital and yj, 7 =1,...,p, are the mean effects of the drug being 
tested. In all these examples, the different studied populations are related 
to each other. In Morris (1983), which we closely follow in Section 9.2, the 
p populations correspond to p baseball players and j1;’s are average scores. 
Other such studies are listed in Morris and Christiansen (1996). 

In order to assign a prior distribution for the 4;’s, we model them as ex- 
changeable rather than i.i.d. or just independent. An exchangeable, dependent 
structure is consistent with the assumption that the studies are similar in a 
broad sense, so they share many common elements. 

On the other hand, independence may be unnecessarily restrictive and 
somewhat unintuitive in the sense that one would expect to have separate 
analysis for each sample if the w;’s were independent and hence unrelated. 
However, to justify exchangeability one would need a particular kind of de- 
pendence. For example, Morris (1983) points out that the baseball players in 
his study were all hitters; his analysis would have been hard to justify if he 
had considered both hitters and pitchers. 

Using a standard way of generating exchangeable random variables, we in- 
troduce a vector of hyperparameters 7 and assume j1,’s are i.i.d. m(u\7) given 
n. Typically, if f(x|u) belongs to an exponential family, it is convenient to 
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take 7(,:|7) to be a conjugate prior. It can be shown that for even moderately 
large p — in the baseball example of Morris (1983), p = 18 — there is a lot of 
information in the data on 7. Hence the choice of a prior for 7) does not have 
much influence on inference about y,;’s. It is customary to choose a uniform 
or one of the other objective priors (vide Chapter 5) for 7. 

We illustrate these ideas by exploring in detail Example 9.1. 


Example 9.3. (Example 9.1, continued) Let f (x|u;) be the density of N(u;, 07) 
where we initially assume g? is known. We relax this assumption in Subsection 
9.1.1. 

The prior for u; is taken to be N(1, 72) where 7 is the prior guess about 
the js;’s and 72 is a measure of the prior uncertainty about this guess, vide 
Example 2.1, Chapter 2. The prior for 71,72 is t(m, 72), which we specify a 
bit later. 

Because (X; = X —; Xi;/n,j = 1,...,p) form a sufficient statistic and 
X,;’s are independent, the Bayes estimate for jz; under squared error loss is 


B(ujiX) = Eu) X) = | E(uj/X.m)x(n\X)an 


where X = (Xj;,2=1,...,n, j =1,...,p), X = (Xı,..., Xp) and (vide 
Example 2.1) 


nX; + (o7/n)m 


n2 +(o2/n) (1—B)X;+Bm, (9.1) 


E(u3|X,n) = E(u;|Xj,n) = 


with B = (o° /n)/(m2 +07/n), depends on data only through Xj. 

The Bayes estimate of uj, on the other hand, depends on X j as above 
and also on (X1,..., Xp) because the posterior distribution of 7 depends on 
all the X js. Thus the Bayes estimate learns from the full sufficient statistic 
justifying simultaneous estimation of all the j;’s. This learning process is 
sometimes referred to as borrowing strength. If the modeling of ju,’s is realistic, 
we would expect the Bayes estimates to perform better than the X ; 8. This 
is what is strikingly new in the case of large p, small n and follows from the 
exchangeability of y;’s. 

The posterior density (|X) is also easy to calculate in principle. For 
known o7, one can get it explicitly. 

Integrating out the y;’s and holding n fixed, we get X,’s are independent 
and 


Xin ~ N(m, m +07 /n). (9.2) 


Let the prior density of (71,72) be constant on R x RT. (See Problem 1 and 
Gelman et al. (1995) to find out why some other choices like n(n, 72) = 1/12 
are not suitable here.) 

Given (9.2) and 7(m, 72) as above, 


258 9 High-dimensional Problems 


gi p/2 p 
X) x {2am TD) exp f ara DA =m pat 


2(n2 Pm 
og —1/2 p 
xX Qn + — ex rr — Xy? 
l (n2 o} p] om + E A 
~ 1 
(m+ 2 eS (9.3) 


where X =. Pay Xy and S = 5% _,(X; — X)?. 
In a similar a 


a(l X) = | [rX mrX) an (9.4) 


j=l 


where (u;|X;, n) are independent normal with 


mean as in (9.1) and variance omai (9.5) 
nz +0?/n 
and E F n 
t(n X) = r(m|X, n2)n(m|X) (9.6) 


is the product of a normal and inverse-Gamma (as given in (9.3)). 
Construction of a credible interval for jz; is, in principle, simple. Consider 
m(1;|X) and fix 0 < a < 1. Calculate the posterior quantiles H, (X), Œ (X) 


of orders 100a/2 and 100(1 — a/2) for given data. Then 


P{u (X) < m < Ay(X)|X} =1- 


In general, to calculate u. and ñ; one would have to resort to MCMC 
which is explained in Subsection 9.1.1. 


For large p, good approximations are available. To do this we anticipate 
to some extent Section 9.2. 

Because p is large, we can invoke the theorem on posterior normality 
(Chapter 4). Thus the posterior for 7 is nearly normal with mean 7 and 
variances of order O(1/p), 7 being the MLE of ņ based on the “likelihood” 


Il f(z5\n). 


Hence, m(n|x) is approximately (in the sense of convergence in distribution) 
degenerate at 7. This implies 
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m(us1X) = | aX; mrin) dn 
= n(u;|X;, ù) (approximately) . (9.7) 
This in turn implies the Bayes estimate of u; is 
E(u;|X) = B(u,|X;,7) (approximately) (9.8) 


which, by (9.1), is a shrinkage estimate that shrinks X į towards 7,, and which 
depends on all the sample means. 

The approximation (9.8) is correct up to O(1/p). A similar argument pro- 
vides an approximation to the posterior s.d. but the accuracy is only O(1/,/p). 

Simulations indicate the approximation for the Bayes estimate is quite 
good but that for the posterior s.d. is much less accurate. It is known, vide 
Morris (1983), that the approximation is also inadequate for credible intervals. 

As a final application of this approximation we indicate it is possible to 
check whether the prior 7(1;|77) is consistent with data. More precisely, we 
check the consistency of f(Z%;|7) with data, but a check for f(%,|7) is indirectly 
a check for m(j1;|77). 

In the light of the data, 7 = 77 is the most likely value of the hyperparam- 
eter 7. Under 7, X;’s are i.i.d normal with mean and variance given by (9.2) 
with 7) replaced by 77. We can now check the fit of this model to the empirical 
distribution via the Q-Q plot. For each 0 < p < 1, we plot the 100pth quan- 
tiles for the theoretical and empirical distributions as (x(p), y(p)). If the fit is 
good, the resulting curve {(x(p), y(p)),0 < p < 1} should scatter around an 
equiangular line passing through the origin. 


9.1.1 MCMC and E-M Algorithm 


We begin with p exponential densities of the same form, namely, 


k 
exp fuc) + dente | Mey. J= Nag (9.9) 


i=1 


The conjugate prior density for the 7th model is proportional to 


k 
exp{noc(8;) + So mj}, j= Lop (9.10) 
i=1 


Note that the hyperparameters (n0, Mı,- --, Ng) are the same for all j. Finally, 
consider a prior m(n) for the hyperparameters. 

Let X = (Xi,...,X,) and 8 = (@),...,9,). The conditional density of 
0 given X,77 is 


p k 
z(0|X, n) x | | exp{(n + n)c(O;) + X (tale) + 4) O50} (9.11) 
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which shows conditionally @,’s remain independent. Also 


p p k 
r(n|X,0) x exp{pa(n) + (mo +n) )_e(85) +), 9 (mi + tye(@5))8j:}4(n) 
(9.12) 
where exp(d(77)) is the normalizing constant of the expression in (9.10). 
By (9.12), the Bayes formula and cancellation of some common terms in 
the numerator and denominator of the Bayes formula, 


p p k 
n(n X, 0) x exp{pd(n) + no X (83) +X 9 mOayrln 


If d(ņ) has an explicit form, as is often the case, one can apply Gibbs sam- 
pling to draw samples from the joint posterior of 0 and n using the conditionals 
7(O6|X,7) and x(n|X, 0). Otherwise one can apply Metropolis-Hastings. 

In general, the approximations based on 77 are still valid but computation 
of 7) is non-trivial. It turns out that the E-M algorithm can be applied, vide 
Gelman et al. (1995, Chapter 9). 

‘We illustrate the algorithms for MCMC and E-M in the case of N(1;,07), 
i= 1,...,p, with (44,..., Hp) and o? unknown. MCMC is quite straightfor- 
ward here. Recall Example 7.13 from Chapter 7. The hierarchical Bayesian 
analysis of the usual one-way ANOVA was discussed there. With minimal 
modification, the same algorithm fits here to allow Gibbs sampling. Ap- 
plication of the E-M algorithm is also easy, as discussed in Gelman et al. 
(1995). We assume as before that p; are iid. N(71,72), and further take 
m(,07, n2) = 1/07. Then, recall from Section 7.2 that we need to apply the 
E and M steps to 


1 P 
2 
log 7(ft,1,0 ‚mX ) = ~(5 1) log o? D 5 log M2 ~ amp Sts — 


1 2 
eee 5 ` (Xi; — pj)? + constants . (9.13) 
oj 
Jj=i a 


The E-step requires the conditional distribution of (jt,07) given X and the 


current value (nt ) nge )y of (71,72). This is just the normal, inverse Gamma 


model. In the M-step we need to maximize E'° (log (js, n, 07, 72|X)) as a 
function of (7,72) which is straightforward. 


9.2 Parametric Empirical Bayes 


To explain the basic ideas, we consider once more the special case of N (p3, a). 
g? known. Explicit formulas are available in this special case for comparison 


9.2 Parametric Empirical Bayes 261 


with the estimates of Stein. Another interesting special case is discussed in 
Carlin and Louis (1996, Chapter 3). 

The PEB approach was introduced by Efron and Morris in a series of 
papers including Efron and Morris (1971, 1972, 1973a, 1973b, 1975, 1976). 
In this section we generally follow Morris (1983). 

Efron and Morris tried to take an intermediate position between a fully 
Bayes and a fully frequentist approach by treating the likelihood as given by 
f(£;|n) obtained by integrating out the u;’s as in (9.2). The 7’s are treated 
as unknown parameters as in frequentist analysis whereas the jz;’s are treated 
as random variables as in Bayesian analysis. This leads to a reduction of a 
high-dimensional frequentist problem about u;’s to a low-dimensional semi- 
frequentist problem about 7, about which there is a lot of information in the 
data. The fully Bayesian and the PEB approach differ in that no prior is as- 
signed to 7, and ņ is estimated by MLE or by a suitable unbiased estimate. 
So one may, if one wishes, think of the PEB approach as an approximation 
to the hierarchical Bayes approach of Section 9.1. A disadvantage of PEB is 
that accounting for the uncertainty about 7 is more difficult than in hierar- 
chical Bayes — a point that would be discussed again in subsection 9.2.1. An 
advantage is that one gets an explicit estimate for u;, namely, (9.1) with 7 
replaced by an estimate of 77. 

Note that under the likelihood (9.2), the complete sufficient statistic is the 
pair 





p p 
(X = : 2X: S= 2% So (9.14) 
Also, X and 
bee goi (9.15) 
are unbiased estimates of ņı and 
B= oe (9.16) 
Then the best unbiased predictor of the RHS of (9.1) is 
ji; = (1 — B)X;+ BX (9.17) 


which is the famous James-Stein-Lindley estimate of j1;. It shrinks X; towards 
the overall mean X. 

The amount of shrinkage is determined by Ê, which is close to 1 if S/(p—3) 
is close to g?/n and close to zero if S/(p — 3) is much larger than o7/n. Note 
that if S/(p — 3) is small, as in the first case, then the X,;’s are close to X 
indicating j4;’s are close to each other. This justifies a fairly strong shrinkage 
towards the grand mean. On the other hand, a large S/(p — 3) indicates 
heterogeneity among the u;’s, suggesting relatively large weight for Xj. 

Because 
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B(S/(p - 1) = = +m, (9.18) 


an unbiased estimate of 72 is 72 = S/(p— 1) — 07/n. Because m > 0, a 
more plausible estimate is 77 = max(0, 2), the positive part of the unbiased 
estimate. This suggests changing the estimate of B to 


-2 we [= Coy (9.19) 


which is the James-Stein-Lindley positive part estimate. 
If we take 7; = 0, i.e., u;’s are iid N(O, 72), then the two estimates are 
of the form 





ji; = (1 — Ê) X; and fi; = (1 — B)X;. (9.20) 


These are the James-Stein and James-Stein positive part estimates. They 
shrink the estimate towards an arbitrary point zero and so do not seem at- 
tractive in the exchangeable case. But they have turned out to be quite useful 
in estimating coefficients in an orthogonal expansion of an unknown func- 
tion with white noise as error, vide Cai et al. (2000). We study frequentist 
properties of these two estimates in Section 9.4. 


9.2.1 PEB and HB Interval Estimates 


Morris defines a random confidence interval (uw (Xx ),#;(X)) for u; to have 
PEB confidence coefficient (1 — a) if 


Pn{u, < Hj < i} 2 1- o: (9.21) 


Let $? = [1 — ((p— 1)/p) Blo? /n + (2/(p — 3))B? (X; — X)?. Morris has con- 
jectured E 
KX, bas, (9.22) 


is a PEB confidence interval with confidence coefficient 1 — a. 

It is shown in Basu et al. (2003) that the conjecture is not true but the 
violations are so rare and so small in magnitude that it hardly matters. Basu 
et al. (2003) suggest an adjustment that would make (9.22) true up to O(p~’). 
It is also shown there that asymptotically the adjusted interval is equivalent 
to a PEB interval proposed by Carlin and Louis (1996, Chapter 3). 

A trouble with Morris’s interval is that it is somewhat ad hoc. We are 
not told how exactly it is derived. It seems he puts a noninformative prior 
on 71,72 and adjusts somewhat the HB credible interval to get a conservative 
frequentist coverage probability. 

There is a natural alternative that does not require additional adjustment 
and ensures (9.21) with the inequality replaced by an equality up to O(p~?). 
To do this, one has to choose a prior for 7 that is probability matching in the 
sense of 
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Pa{u, < Hj < Bj} =1-a+O(p~"), (9.23) 


where 


P{u; > j| X} = a/2, 
P{ p; < u,|X} = 0/2, (9.24) 


and the probabilities in (9.24) are the posterior probabilities under the prior 
for 7. This leads to probability matching differential equations. A solution is 


o7/n 


rer (9.25) 


t(n) = 


vide Datta, Ghosh, and Mukerjee (2000). 


9.3 Linear Models for High-dimensional Parameters 


We can extend the HB and PEB approach to a more general setup using 
covariates and linear models. The parameters are no longer exchangeable but 
are affected by a common set of low-dimensional hyperparameters assuming 
the role of 7. The model in Sections 9.1 and 9.2 is a special case of the linear 
model below. 

Following Morris (1983), we change our notations slightly 


Y;\0; ~ N(6;,V), j=1,...,p independent, (9.26) 


and given GZ, A, 
0px1 = Zpxrßrxı F E€px1, (9.27) 


e;’s are ii.d N (0, A). Here p is at least moderately large, r is relatively small. 
Morris allows the variance of e; to depend on j, which is often a more realis- 
tic assumption. Keeping the same variance A for all 7 simplifies the algebra 
considerably. 
In the HB analysis we need to put a further prior on GB. A conjugate prior 
for B given A is 
B ~ N(y1,72(2'Z)~*). (9.28) 


Finally, A is given an inverse Gamma or a uniform or the standard prior 
1/A for scale parameters. Assuming V is known, our specification of priors is 
complete. 

To do MCMC we partition the parameters and (random) hyperparameters 
into three sets (8,3, A). The conditionals are as follows. 
(1) Given 8, A (and Y), @ is multivariate normal. 
(2) Given 6, A ( and Y), B is multivariate normal. 
(3) Given 6, 8 ( and Y), A has an inverse Gamma distribution. You are asked 
to find the parameters of these conditionals in Problem 6. 
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In the PEB approach, one first notes 
0ilYi, B, A ~ N(@;,V(1— B)), (9.29) 


where | 
0; = (1 — B)Y: + BZ;ß (9.30) 


with B = V/(V + A). Here Z; is the ith column vector of Z. This is the 
shrinkage estimate corresponding with (9.1) of Section 9.1. 

In the PEB approach one has to estimate 8 and B either by maximizing 
the likelihood of the independent Y;’s with 


Y;|B, A ~ N(ZiG,V + A) (9.31) 
or by finding a suitable unbiased estimate as in (9.18). Let 
Ê =(Z'ZY(Z'Y). 


The statistic B and i . 

S=(¥ - ZB)'(Y — ZB) 
are independent, complete sufficient statistics for (8, A). Hence the best un- 
biased estimates for G and B are @ and 


B=(p—r—2)V/S 
(vide Problem 10). Substituting in the shrinkage estimate 0* for 6;, one gets 
6; = (1 — B)Y,+ B2Z’B. 


This is the analogue of James-Stein-Lindley estimate under the regression 
model. . 

In Problem 8, you are asked to show that the PEB risk of 6;, namely 
Eg, a(9; —9;) is smaller than the PEB risk of Y;, namely, Eg 4(Y; — ;)?. The 
relative strength of the PEB estimate comes through the use of B, which is 
based on the full data set Y. 

In Section 8.3, linear regression is discussed as a common statistical prob- 
lem where an objective Bayesian analysis is done. You may wish to explore 
how similar ideas are used in this section to model a high-dimensional prob- 
lem. 


9.4 Stein’s Frequentist Approach to a High-dimensional 
Problem 


Once again we study Example 9.1. Let X;’s be independent, X; ~ N(u;,07/n). 
Classical criteria like maximum likelihood, minimaxity or minimizing variance 
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among unbiased estimates, all lead to (X1,..., Xp) as estimate of (j11,..., Mp). 
Let p > 3. Stein, in a series of papers, Stein (1956), James and Stein (1960), 
Stein (1981), showed that if we define a loss function 


LX, y) = 9 (X5 = ay)? (9.32) 
and generally 
LT, y) = POF) - wy)? (9.33) 


for a general estimate T, it is possible to choose a T that is better than X in 
the sense f 
Ep (L(T, )) < Ey (L(X, pw) for all p. (9.34) 


Stein (1956) provides a heuristic motivation that suggests X is too large 
in a certain sense explained below. To see this compare the expectation of the 
squared norm of X with the squared norm of p. 


Eal X15 = llul? + po*/n > llall. (9.35) 


The larger the p the bigger the deviation between the LHS and RHS. So 
one would expect at least for sufficiently large p, one can get a better estimate 
by shrinking each coordinate of X suitably towards zero. We present below 
two of the most well-known shrinkage estimates, namely, the James-Stein 
and the positive part James-Stein estimate. Both have already appeared in 
Section 9.2 as PEB estimates. It seems to us that the PEB approach provides 
the most insight about Stein’s estimates, even though the PEB interpretation 
came much later. 

Morris points out that there is no exchangeability or prior in Stein’s ap- 
proach but summing the individual losses produces a similar effect. Moreover, 
pooling the individual losses would be a natural thing to do only when the dif- 
ferent u;’s are related in some way. If they are totally unrelated, Stein’s result 
would be merely a curious fact with no practical significance, not a profound 
new data analytic tool. It is in the case of exchangeable high-dimensional 
problems that it provides substantial improvement. 

We present two approaches to proving that the Stein-James estimate is 
superior to the classical estimate. One is based on Stein (1981) with details 
as in Ibragimov and Has’minskii (1981). The other is an interesting variation 
on this due to Schervish (1995). 


Stein’s Identity. Let X ~ N(,07) and ġ(x) be a real valued function dif- 
ferentiable on R with fr p (u)du = o(x) — ọla). Then 


o° E(d'(X)) = E((X — p)0(X)). 
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Proof. Integrating by parts or changing the order of integration 


BOX) == [ew [E] dx 


ore) = — 2 
2 = (a) exp { I dz 
=o *E(¢(X)(X — p)).0 (9.36) 





For more details see the proof in Stein (1981). 
As a corollary we have the following result. 


Corollary. Let (X1, X2,...,Xp) be a random yal ~ Np(H, ey, Let d= 
($1, P2,- - -, Op) : RP + RP 7 differentiable, E| al < 00, 
b; (£1, vee h5-1,%,2541;-- og) = i gdz; and assume that 
OD jane a Jepi =H} — 0 as |x| > œ. Then 
Od; 
B fot SS} = BUX; = H)4) (9.37) 
_ We now return to Example 9.1. The classical estimate for p is X = 
(X1, X2,...,X,). Consider an alternative estimate of the form 
pe = X +n 'o79(X), (9.38) 


where g(x) = (91,9, ---, 9p) : RP —> R? with g satisfying the conditions in the 
corollary. 
Then, by the corollary, 


Ball X — pl? - Epl X + n7'o7g(X) - wll? 
= a Nl i oo — p) 'g( X)} = *o* Eullg(X i 


-ontot 58 5) —n-*o* Ey lig X)|l?. (9.39) 


Now suppose g(x) = grad(log ¢(a)), where ¢ is a twice continuously differen- 
tiable function from RP? into R. Then 


09; ee Pre 1 
ie = lgl? +549 (9.40) 


8? 
where A =}? 2 bz? . Hence 


l E 
a) 


(9.41) 


B,,|X — pl? - Enl- pl? = n?o“ B, ligll? no Ey | 
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which is positive if (2) is a positive non-constant superharmonic function, 
1.€, 


Ad < 0. (9.42) 


Thus ñ is superior to X if (9.42) holds. It is known that such functions exist 
if and only if p > 3. 

To show the superiority of the James-Stein positive part estimate for p > 3, 
take 


p(z) = 1 æl -2-2 if ||a|| > vp — 2 (9.43) 


(p — 2)~'P-?)/2 exp {4 (p — 2) — |la||?)} otherwise. 


Then grad(log ¢) is easily verified to be the James-Stein positive part es- 
timate. 
To show the superiority of the James-Stein estimate, take 


g(x) = ||a||P~. (9.44) 


We observed earlier that shrinking towards zero is natural if one modeled 
jt;’s as exchangeable with common mean equal to zero. We expect substantial 
improvement if pe = 0. 

Calculation shows 


ee 2 = 
Epl% — pl? = J lX — ul’ =2 (9.45) 


if y = 0,0? = 1,n = 1. 

It appears that borrowing strength in the frequentist formulation is possi- 
ble because Stein’s loss adds up the losses of the component decision problems. 
Such addition would make sense only when the different problems are con- 
nected in a natural way, ia which case the exchangeability assumption and 
the PEB or hierarchical models are also likely to hold. It is natural to ask how 
good are the James-Stein estimates in the frequentist sense. They are certainly 
minimax since they dominate minimax estimates. Are they admissible? Are 
they Bayes (not just PEB)? For the James-Stein positive part estimate the 
answer to both questions is no, see Berger (1985a, pp. 542, 543). On the other 
hand, Strawderman (1971) constructs a proper Bayes minimax estimate for 
p > 5. Berger (1985a, pp. 364, 365) discusses the question of which among the 
various minimax estimates to choose. Note that the PEB approach leads in 
a natural way to James-Stein positive part estimate, suggesting that it can’t 
be substantially improved even though it is not Bayes. See in this connection 
Robert (1994, p. 66). There is a huge literature on Stein estimates as well as 
questions of admissibility in multidimensional problems. Berger (1985a) and 
Robert (1994) provide excellent reviews of the literature. There are intrigu- 
ing connections between admissibility and recurrence of suitably constructed 
Markov processes, see Brown (1971), Srinivasan (1981), and Eaton (1992, 
1997, 2004). 
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When extreme p’s may occur, the Stein estimates do not offer much im- 
provement. Stein (1981) and Berger and Dey (1983) suggest how this problem 
can be solved by suitably truncating the sample means. For Stein type results 
for general ridge regression estimates see Strawderman (1978) where several 
other references are given. 

Of course, instead of zero we could shrink towards an arbitrary uo. Then 
a substantial improvement will occur near uo. Exactly similar results hold for 
the James-Stein-Lindley estimate and its positive part estimate if p > 4. 

For the James-Stein estimate, Schervish (1995, pp. 163-165) uses Stein’s 
identity as well as (9.40) but then shows directly (with o? = 1,n = 1) 


p 
ð —(p — 2)? 
gl? +2 95 = zz <0. 
J 


1j 


Clearly for 4 = James-Stein estimate, 


= (p — 2)? 
Ey || — ul? = p- Ep E À 
eee 


which shows how the risk can be evaluated by simulating a noncentral x? 
-distribution. 


9.5 Comparison of High-dimensional and 
Low-dimensional Problems 


In the low-dimensional case, where n is large or moderate and p small, the 
prior is washed away by the data, the likelihood influences the posterior more 
than the prior. This is not so when p is much larger than n — the so-called 
high-dimensional case. The prior is important, so elicitation, if possible, is 
important. Checking the prior against data is possible and should be explored. 
We discuss this below. 

In the high-dimensional cases examined in Sections 9.2 and 9.3 some as- 
pects of the prior, namely m(j1;|77), can be checked against the empirical distri- 
bution. We have discussed this earlier mathematically, but one can approach 
this from a more intuitive point of view. Because we have many p;’s as sample 
from m(y;|7) and X;’s provide approximate estimates of j1;’s, the empirical 
distribution of the X;’s should provide a check on the appropriateness of 
m(15|7)). 

Thus there is a curious dichotomy. In the low-dimensional case, the data 
provide a lot of information about the parameters but not much informa- 
tion about their distribution, i.e., the prior. The opposite is true in high- 
dimensional problems. The data don’t tell us much about the parameters but 
there is information about the prior. 
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This general fact suggests that the smoothed empirical distribution of es- 
timates could be used to generate a tentative prior if the likelihood is not 
exponential and so conjugate priors cannot be used. Adding a location-scale 
hyperparameter 7 could provide a family of priors as a starting point of ob- 
jective high-dimensional Bayesian analysis. 

Bernardo (1979) has shown that at least for Example 9.1 a sensible 
Bayesian analysis can be based on a reference prior with a suitable repa- 
rameterization. It does seem very likely that this example is not an exception 
but a general theory of the right reparameterization needs to be developed. 


9.6 High-dimensional Multiple Testing (PEB) 


Multiple tests have become very popular because of application in many areas 
including microarrays where one searches for genes that have been expressed. 
We provide a minimal amount of modeling that covers a variety of such appli- 
cations arising in bioinformatics, statistical genetics, biology, etc. Microarrays 
are discussed in Appendix D. Whereas PEB or HB high-dimensional estima- 
tion has been around for some time, PEB or HB high-dimensional multiple 
testing is of fairly recent origin, e.g., Efron et al. (2001a), Newton et al. (2003), 
etc. 

We have p samples, each of size n, from p normal populations. In the 
simplest case we assume the populations are homoscedastic. Let g? be the 
common unknown variance, and the means {144,..., Hp- 

For Hj, consider the hypotheses Ho; > Hj = 0, Ay; : Hj ~ N(m, n) j = 
1,...,p. The data are X;;,t = 1,...n,j = 1,...,p. In the gene expression 
problem, X;;,2 = 1,...n are n i.i.d. observations on the expression of the 
jth gene. The value of |X;;| may be taken as a measure of observed intensity 
of expression. If one accepts Hoj, it amounts to saying the jth gene is not 
expressed in this experiment. On the other hand, accepting Hı; is to assert 
that the jth gene has been expressed. Roughly speaking, a gene is said to be 
expressed when the gene has some function in the cell or cells being studied, 
which could be a malignant tumor. For more details, see the appendix. In 
addition to Ho; and H,;, the model involves 79 = probability that Ho; is true 
and mı = 1 — 79 = probability that Hı; is true. If 


L= 1 if Hı; is true; 
J) 0 if Ho; is true, 


then we assume /1,...,/, are iid. ~ B(1, m1). 

The interpretation of mı has a subjective and a frequentist aspect. It rep- 
resents our uncertainty about expression of each particular gene as well as 
approximate proportion of expression among p genes. 

If o°, 71,7, 72 are all known, X; is sufficient for p; and a Bayes test is 
available for each j. Calculate the posterior probability of Hy;: 
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we ee AA) 
1J nı fı(X;) E To fo(X;) 


which is a function of X j only. Here fo and fı are densities of X j under Hoj 
and H lj. 


1 
If Tij > z accept Hı; and 


if 71; <a 


5 accept Hoj. 


This test is based only on the data for the jth gene. 

In practice, we do not know 71,7, 72. In PEB testing, we have to estimate 
all three. In HB testing, we have to put a prior on (7,71, 72). To us a natural 
prior would be a uniform for mı on some range (0,4),6 being upper bound to 
rı, uniform prior for 7; on R and uniform or some other objective prior for 
N2. 

In the PEB approach, we have to estimate 71,71, n2. If o2 is also unknown, 
we have to put a prior on g° also or estimate it from data. An estimate of o° 
is $; a (Xi; — X) p(n — 1)}. 

For fixed mı, we can estimate 7; and 72 by the method of moments using 
the equations, 


— 1l = 
= z2 Žij = 711, (9.46) 

1 * =A g? 9 

5 OKs ~ X) = n tam + mh- m), (9.47) 


from which it follows that 


| 
Ti 


a an e sl a 
R= P : n us ) l 


Alternatively, if it is felt that 7, = 0, then the estimate for nə is given by 


pa ea (9.50) 
a p j = i : 


Now we may maximize the joint likelihood of X,;’s with respect to 7. 

Using these estimates, we can carry out the Bayes test for each 7, provided 
we know 7 or put a prior on %1. We do not know of good PEB estimates of 
T1. 

Scott and Berger (2005) provide a very illuminating fully Bayesian analysis 
for microarrays. 


9.6 High-dimensional Multiple Testing (PEB) 271 
9.6.1 Nonparametric Empirical Bayes Multiple Testing 


Nonparametric empirical Bayes (NPEB) solutions were introduced by Robbins 
(1951, 1955, 1964). It is a Bayes solution based on a nonparametric estimate of 
the prior. Robbins applied these ideas in an ingenious way in several problems. 
It was regarded as a breakthrough, but the method never became popular 
because the nonparametric methods did not perform well even in moderately 
large samples and were somewhat unstable. 

Recently Efron et al. (2001a, b) have made a successful application to a 
microarray with p equal to several thousands. The data are massive enough 
for NPEB to be stable and perform well. 

After some reductions the testing problem takes the following form. 
For j = 1,2,...,p, we have random variables Z;. Z; ~ fo(z) under Ho; and 
Z; ~ fı(z) under Hı; where fo is completely specified but fi(z) Æ folz) is 
completely unknown. This is what makes the problem nonparametric. Finally, 
as in the case of parametric empirical Bayes, the indicator of Hı; is I; = 1 
with probability mı and = 0 with probability mo = 1 — 7). If 71 and fı were 
known we could use the Bayes test of Ho; based on the posterior probability 
of H 1j 

Tı filz) 

Tı filzj) + (1 — m1) folzi) 
Let f(z) = mfi(z) + (1 — mı) folz). We know fo(z). Also we can estimate 
f(z) using any standard method — kernel, spline, nonparametric Bayes, vide 
Ghosh and Ramamoorthi (2003) — from the empirical distribution of the z,’s 
But since mı and fı are both unknown, there is an identifiability problem and 
hence estimation of 71, fı is difficult. The two papers, Efron et al. (2001a, b), 
provide several methods for bounding 7. 

One bound follows from 


P(Hi;|z;) = 


To < min[f(2)/fo(z)}, 


m > 1 — min[f(2)/ fol2)) 
So the posterior probability of Hı; is 


Jey — 1 — Tolz) exes E 
Phs) = 3 — "EE >i- {min EOP A 








which is estimated by 1 — fmin, Ha (| fle) where f is an estimate of f as 
Zj 

mentioned above. The minimization will usually be made over observed values 

of z. 


Another bound is given by 


n < Jatz 
E Ja folz)dz ioe 
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Now minimize the RHS over different choices of A. Intuition suggests a good 
choice would be an interval centered at the mode of fo(z), which will usually 
be at zero. A fully Bayesian nonparametric approach is yet to be worked out. 
Other related papers are Efron (2003, 2004). For an interesting discussion of 
microarrays and the application of nonparametric empirical Bayes methodol- 
ogy, see Young and Smith (2005). 


9.6.2 False Discovery Rate (FDR) 


The false discovery rate (FDR) was introduced by Benjamini and Hochberg 
(1995). Controlling it has become an important frequentist concept and 
method in multiple testing, specially in high-dimensional problems. We pro- 
vide a brief review, because it has interesting similarities with NPEB, as noted, 
e.g., in Efron et al. (2001a, b). We consider the multiple testing scenario in- 
troduced earlier in this section. Consider a fixed test. The (random) FDR for 
the test is defined as Tel z)>0}, where U = total number of false discov- 
eries, i.e., number of true Ho,;’s that are rejected by the test for a z, and V = 


total number of discoveries, i.e., number of Ho;’s that are rejected by a test. 
The (expected) FDR is 


U 
FDR = Ep (v>) . 


To fix ideas suppose all Ho;’s are true, i.e., all 44;’s are zero, then U = V and 
SO 


U 
pi{v>o} = Ipyvso} 
and 


FDR = Py=o( at least one Ho; is rejected ) 
Type 1 error probability under the full null. 


This is usually called family wise error rate (FWER). The Benjamini-Hochberg 
(BH) algorithm (see Benjamini and Hochberg (1995)) for controlling FDR is 
to define 


jo = max{j: Pj) < Za} 
where P; = the P-value corresponding with the test for jth null and Pj) = 
jth order statistic among the P-values with Pa) = the smallest, etc. 


The algorithm requires rejecting all Ho; for which P; < Pje). Benjamini 
and Hochberg (1995) showed this ensures 


U p 
Ep (Flo) < 341° <aVy 


where po is the number of true Ho,’s. It is a remarkable result because it is 
valid for all yz. This exact result has been generalized by Sarkar (2003). 
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Benjamini and Liu (1999) have provided another algorithm. See also Ben- 
jamini and Yekutieli (2001). Genovese and Wasserman (2001) provide a test 
based on an asymptotic evaluation of jọ and a less conservative rejection rule. 
An asymptotic evaluation is also available in Genovese and Wasserman (2002). 
See also Storey (2002) and Donoho and Jin (2004). Scott and Berger (2005) 
discuss FDR from a Bayesian point of view. 

Controlling FDR leads to better performance under alternatives than con- 
trolling FWER. Many successful practical applications of FDR control are 
known. On the other hand, from a decision theoretic point of view it seems 
more reasonable to control the sum of false discoveries and false negatives 
rather than FDR and proportion of false negatives. 


9.7 Testing of a High-dimensional Null as a Model 
Selection Problem!’ 


selection from among nested models is one way of handling testing problems 
as we have seen in Chapter 6. Parsimony is taken care of to some extent 
by the prior on the additional parameters of the more complex model. As 
in estimation or multiple testing, consider samples of size r from p normal 
populations N(y;,07). For simplicity g? is assumed known. Usually g? will 
be unknown. Because S° = $., > (Xij — X;)?/p(r—1) is an unbiased estimate 
of o? with lots of degrees of freedom, it does not matter much whether we put 
one of the usual objective priors for g? or pretend that o? is known to be S?. 

We wish to test Hp : p = 0 Vi versus Hy: at least one u Æ 0. This 
is sometimes called Stone’s problem, Berger et al. (2003), Stone (1979). We 
may treat this as a model selection problem with Mp = Ho: pw; = 0 Vi and 
Mı = Ho U Ha, i.e., Mı : y E RP. In this formulation, Mo C Mı whereas Ho 
and H; are disjoint. On grounds of parsimony, Hp is favored if both Mp and 
My, are equally plausible. 

To test a null or select a model, we have to define a prior m(je) under Mı 
and calculate the Bayes factor | 


Jre Ii- fi(Xilui)r( dp 


There is no well developed theory of objective priors, specially for test- 
ing problems. However as in estimation it appears natural to treat j1;’s as 
exchangeable rather than independent. A popular prior in this context is the 
Zellner and Siow (1980) multivariate Cauchy prior 


Bo 


pee) pel [h, œ+) 
™ (yt) = =e ie 5) ) 
T 2 gP oj 








l Section 9.7 may be omitted at first reading. 
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Another plausible prior is the smooth Cauchy prior given by 


+ t 1 
-zhp e 2t” 2dt. 





z ee 


27 


(9.51) 


ree 1 2 
Tec() = aoe 5 ee EMG, A, a 
T(P) (4) (2x02)? 
1 
P 
= OE own ne 


where M (4, 
Stegun (1970). 

It is tempting to use the difference ibeiween the two models} of BIC as 
an approximation to the logarithm of Bayes factor (BF) even though it was 
developed by Schwarz for low-dimensional problems. Stone was the first to 
point out that the use of BIC is problematic in high-dimensional problems. 
Berger et al. (2003) have developed a generalization of BIC called GBIC, which 
provides a good approximation to the integrated likelihood for priors like the 
above Cauchy priors which are obtained by integrating the scale parameter 
for N(ui, o°). In Stone’s problem one has the normal linear model setup 


bie ‘ont BH) is the hypergeometric ı F; function of Abramowitz and 


Nie fe Peg t= ler 9g = leah Se: (9.52) 


It is assumed that as n —> oo, p > œ and r is fixed. Under these assumptions, 
Berger et al. (2003) provide a Laplace approximation and a GBIC. The GBIC 
also approximates the BIC for low-dimensional problems. The formula for 
AGBIC (the difference of GBIC for the comparison of Mı and Mo) is given 


by 
ial l 
AGBIC = (SX'X = E log(rep) = =) = =, (9.53) 
where cp = +3 x . Table 9.1, taken from Berger et al. (2003) provides 


some idea of a ce of BIC, GBIC and Laplace approximation. One has 
p = 50 and r = 2 for these calculations and the multivariate Cauchy prior 
was used. 

Substantial new results appear in Liang et al. (2005). They propose a 
mixture of Zellner’s (Zellner (1986)) popular g-prior. In Zellner’s form, the 
prior looks like p|M, ~ N(0,-%(Z’Z)~') where Z is the design matrix (in 
our problem only composed of 0’s and 1’s). This g is usually elicited through 
an empirical Bayes method. The above authors consider a family of mixtures 
of g-priors (under which the Zellner-Siow Cauchy prior is a special case) and 
use those for model selection. They propose Laplace approximations to the 
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Table 9.1. Comparison of the Performance of GBIC and Laplace Approximation 


with BIC 

















marginal likelihood under these general priors and show that the models thus 
selected are generally correct asymptotically if the complex model is true. 
Under the null model, this type of consistency still holds under the Zellner- 
Siow prior. 

Further generalizations to non-normal problems appear in Berger (2005) 
and Chakrabarti and Ghosh (2005a). Both papers provide generalizations of 
BIC when the observations come from an exponential family of distributions in 
high-dimensional problems. In Table 9.2, using simulation results reported in 
Chakrabarti and Ghosh (2005a), the performance of GBIC and the Laplace 
approximation (log 72) with BIC are compared in approximating the inte- 
grated likelihood under the more complex model (denoted by m2) when the 
more complex model is actually true and observations come from Bernoulli, 
exponential, and Poisson distributions. In this case one has p groups of obser- 
vations, each group having a (potentially) different parameter value and each 
group has r observations. Under the simpler model, these different groups are 
assumed to have the same (specified) parameter value, while for the more 
complex model the parameter vector is assumed to belong to RP. See the 
paper for details on the priors used. 

In principle, the same methods apply to any two nested models 

Mo : u; =0,1 <2 < pı, pı <p versus Mi: p ERP. 


Table 9.2. Approximation to Integrated Likelihood in the Exponential Family 


Distribution p| | logem: | berz | BIC | GBIC 
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9.8 High-dimensional Estimation and Prediction Based 
on Model Selection or Model Averaging? 


Given a set of data from an experiment or observational study done on a 
given population, a statistician is asked the following three questions quite 
frequently. First, which among a given set of possible statistical models seems 
to be the correct model describing the underlying mechanism producing the 
data? Second, what will be the predicted value of a future observation, if 
the experimental conditions are kept at predetermined levels? Third, what is 
the estimate of a single parameter or a vector (may be infinite dimensional) 
of parameters? We will focus in this section on some Bayesian approaches 
to answer the last two types of questions. But before going into the details, 
we will explain briefly in the next paragraph how one would pose the above 
three questions from a decision theoretic point of view and what is the basic 
difference in the Bayesian approaches in tackling such questions. 

Bayesian approaches to such questions are basically dictated by the goal of 
obtaining decision theoretic optimality, and hence the solutions are also heav- 
ily dependent upon the type of loss functions being used. The loss function, on 
the other hand, is mostly determined by the goal of the statistician or practi- 
tioner. The goal of the statistician in the first problem above is to select the 
correct model (which is assumed to be one in the list of models considered). 
The loss function often used in this problem is the 0-1 loss function. In the 
Bayesian approach to model selection, the statistician would put prior proba- 
bilities on the set of candidate models and a simple argument shows that for 
this loss, the optimum Bayesian model would be the posterior mode, i.e., the 
model that has the maximum posterior probability. As explained in the ear- 
lier section, BIC and GBIC can be used to select a model using the Bayesian 
paradigm with 0-1 loss if the sample size is large, in appropriate situations, as 
they approximate the integrated likelihood and hence can be used to find the 
model with highest posterior probability. On the other hand, if one is inter- 
ested in answering the second or third question above (i.e., if one is interested 
in prediction or estimation of a parameter), the problem can be approached in 
two different ways. First, one might be interested in finding a particular model 
that does the best job of prediction (in some appropriate sense). Secondly, one 
might only want a predicted value, not a particular model for repeated fu- 
ture use in prediction. In either case, the most popular loss function is the 
squared prediction error loss, i.e., the square of the difference between the 
predicted/estimated value and the value being predicted/estimated. The best 
predictor/estimator turns out to be the Bayesian model averaging estimate (to 
be explained later) and the best predictive model is the one which minimizes 
the expected posterior predictive loss. 

We now consider the problem of optimal prediction from a Bayesian ap- 
proach. We use the ideas, notations, and results of Barbieri and Berger (2004) 


2 Section 9.8 may be omitted at first reading. 
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for this part. Consider the canonical model 
y = XG+e, (9.54) 


where y is an n x 1 vector of observations, X is the n x k full rank design 
matrix, @ is the unknown k x 1 vector of regression coefficients and e€ is the 
n x 1 vector of random errors, which are i.i.d. N(0,07), a° being known or 
unknown. Our goal is to predict a future observation y*, given by 


y =x Be, (9.55) 
where x* = (aj,...,27) is the value of the covariate vector for which the 
prediction is to be made. We consider the loss in predicting y* by ĝ* as 

L(g", y*) = (0* — y*)’; (9.56) 


i.e., the squared error prediction loss. Assume that we have submodels 
Mi: y = Xf +e, (9.57) 


where 1 = (l1,..., lk) with l; = 1 or 0 according as the ith covariate is in the 
model Mı or not, Xj is a matrix containing columns of X corresponding with 
the nonzero coordinates of l and Q) is the corresponding vector of regression 
coefficients. Let kı denote the number of covariates included in the model; 
then X; is of dimension (n x ky) and 61 is a (ky x 1) vector. 

We put prior probabilities P( Mı) to each model Mı included in the model 
space such that 5°, P(Mi1) = 1, and given model Mı, a prior 71((Q), o) is as- 
sumed on the parameters (681, c) included in model Mı. Using standard pos- 
terior calculations, one obtains the quantities (a) pı = P(Mily), the posterior 
probability of model Mı and (b) 7(Q@), oly), the posterior distribution of the 
unknown parameters in Mı. With this setup in mind, we shall now discuss 
two optimal prediction strategies, as described below. 

First note that the best predictor of y* for a given value of x* comes 
out as y* = E(y*|y), where the expectation is taken with respect to the 
posterior/predictive distribution of y* given y. This follows by noting that 


El(y* — *)*] = EX El(y* — 9°)’ ly], (9.58) 


where the expectation inside is taken with respect to the posterior distribution 
of y* given y. But note that 


j = Ely*ly) = Y pE" ly, M) =x* ph, (8-59) 
l l 





where H; is a (k x kı) matrix such that x* H; is the subvector of x* correspond- 
ing to the nonzero coordinates of 1 and Â; is the posterior mean of Bı with 
respect to mı (Bp aly). Noting that if we knew that M, were the true model, 
then the optimal predictor of y* for x fixed at x* would be given by 
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ov = x* HB), we have (9.60) 


y* = E(y*|ly) =x*B=x* > mMB, = Y mot. (9.61) 
i I 


y* is called the Bayesian model averaging estimate, in that it is a weighted 
average of the optimal Bayesian predictors under each individual model, the 
weights being the posterior probabilities of each model. Many authors have 
argued the use of the model averaging estimate as an appropriate predictive 
estimate. They justify this by saying that in using model selection to choose 
the best model and then making inference based on the assumption that the 
selected model is true, does not take into account the fact that there is un- 
certainty about the model itself. As a result, one might underestimate the 
uncertainty about the quantity of interest. See, for example, Madigan and 
Raftery (1994), Raftery, Madigan, and Hoeting (1997), Hoeting, Madigan, 
Raftery, and Volinsky (1999), and Clyde (1999); just to name a few, for de- 
tailed discussion on this point of view. However if the number of models in the 
model space is very large (e.g., in case all subsets of parameters are allowed in 
the model space, as will happen in high or even moderately high dimensions), 
the task of computing the Bayesian model averaging estimate exactly might 
be virtually impossible. Moreover, it is not prudent to keep in the model av- 
erage those models that have small posterior probability indicating relative 
incompatibility with observed data. There are some proposals to get around 
this difficulty, as discussed in the literature cited above. Two of them are 
based on the ‘Occam’s window’ method of Madigan and Raftery (1994) and 
the Markov chain Monte Carlo approach of Madigan and York (1995). 

In the first approach, the averaging is done over a small set of appro- 
priately selected models, which are parsimonious and supported by data. In 
the second approach, one constructs a Markov chain with state space same 
as the model space and equilibrium distribution {P(Mj|y)} where Mı varies 
over the model space. Upon simulation from this chain, the Bayesian model 
averaging estimator is approximated by taking average value of the posterior 
expectations under each model visited in the chain. But it must be commented 
that Bayesian model averaging (BMA) has its limitations in high-dimensional 
problems. Each approach addresses both issues but it is unclear how well. 

Although BMA is the optimal predictive estimation procedure, often a 
single model is desired for prediction. For example, choice of a single model 
will require observing only the covariates included in the model. Also, as noted 
earlier, in high dimensions, BMA has its problems. We will assume now that 
the future predictions will be made for covariates x* such that 


Q = E(x*'x*) 


exists and is positive definite. A frequent choice of Q is Q = X’X, i.e., the 
future covariates will be like the ones observed in the past. In general, the best 
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single model will depend on x*, but we present here some general characteri- 
zations which give the optimal predictive model without this dependence. In 
general, the optimal predictive model is not the one with the highest poste- 
rior probability. However, there are interesting exceptions. If there are only 
two models, it is easy to show the posterior mode with shrinkage estimate is 
optimal for prediction (Berger (1997) and Mukhopadhyay (2000)). This also 
holds sometimes in the context of variable selection for linear models with or- 
thogonal design matrix, as in Clyde and Parmigiani (1996). As Berger (1997) 
notes, it is easy to see that if one is considering only two models, say My 
and Mə with prior probabilities 5 each and proper priors are assigned to the 
unknown parameters under each model, the best predictive model turns out 
to be Mı or Mə according as the Bayes factor of Mı to Mə is greater than 
one or not, and hence the best predictive model is the one with the highest 
posterior probability. The characterizations we will describe here are in terms 
of what is called the ‘median probability model.’ If it exists, the median prob- 
ability model Mı» is defined to be the model consisting of those variables only 
whose posterior inclusion probabilities are at least L, The posterior inclusion 
probability for variable 2 is 


p= >| P(Mily). (9.62) 


ie] 


So, I* is defined coordinatewise as l; = 1 if p; > $ and l; = 0 otherwise. 
It is possible that the median probability model does not exist, in that the 
variables included according to the definition of 1* do not correspond with 
any model under consideration. But in the variable selection problem, if we 
are allowed to include or exclude any variable in the possible models, i.e., 
all possible values of l are allowed, then the median probability model will 
obviously exist. Another important class of models is a class of models with 
‘graphical model structure’ for which the median probability model will always 
exist (this fact follows directly from the definition below). 


Definition 9.4. Suppose that for each variable index i, there is a correspond- 
ing index set I (i) of other variables. A subclass of linear models is said to have 
‘graphical model structure’ if it consists of all models satisfying the condition 
‘for each i, tf variable x; is in the model, then variables x; with j € I(i) are 
in the model.’ 


The class of models with ‘graphical model structure’ includes the class of 
models with all possible subsets of variables and sequences of nested models, 
Mj), j = 9,1,...,k, where (7) = (1,...,1,0,...,0) with j ones and k — j 
zeros. For the all subsets scenario, /{i) is the null set while in the nested 
case I(t) = {f : 1 < j < i} fori > 2 and I(i) is the null set for i = 0 or 
1. The latter are natural in many examples including polynomial regression 
models, where j refers to the degree of polynomial used. Another example 
of nested models is provided by nonparametric regression (vide Chapter 10, 
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Sections 10.2, 10.3). The unknown function is approximated by partial sums 
of its Fourier expansion, with all coefficients after stage 7 assumed to be 
zero. Note that in this situation, the median probability model has a simple 
description; one calculates the cumulative sum of posterior model probabilities 
beginning from the smallest model, and the median probability model is the 
first model for which this sum equals or exceeds L, Mathematically, the median 
probability model is Mj;+), where 


j*-1 j* 
M 1 1 
. a , > = Í 
2 P(Myiyly) < 5 and 3 P(Myiy\y) = 3 (9.63) 


We present some results on the optimality of the posterior median model in 
prediction. The best predictive model is found as follows. Once a model is 
selected, the best Bayesian predictor assuming that model is true is obtained. 
In the next stage, one finds the model such that the expected prediction loss 
(this expectation does not assume any particular model is true, but is an over- 
all expectation) using this Bayesian predictor is minimized. The minimizer is 
the best predictive model. There are some situations where the median prob- 
ability model and the highest posterior probability are the same. Obviously, if 
there is one model with posterior probability greater than 4, this will be triv- 
ially true. Barbieri and Berger (2004) observe that when the highest posterior 
probability model has substantially larger probability than the other models, 
it will typically also be the median probability model. We describe another 
such situation later in the corollary to Theorem 9.8. 


We state and prove two simple lemmas. 


Lemma 9.5. (Barbieri and Berger, 2004) Assume Q exists and is positive 
definite. The optimal model for predicting y* under the squared error loss, is 
the unique model minimizing 


R(M) = (M181 — 8)'Q(H1ı — B), (9.64) 
where B is defined in (9.61). 


Proof. As noted earlier, yf is the optimal Bayesian predictor assuming M is 
the true model. The optimal predictive model is found by minimizing with 
respect to l, where 1 belongs to the space of models under consideration, the 
quantity E(y* — ĝř)?. Minimizing this is equivalent to minimizing for each y 
the quantity E[(y* — 9;)?|y]. It is easy to see that for a fixed x*, 


El(y* -9y ly] = C + (y* — ay’, (9.65) 


where the symbols have been defined earlier and C is a quantity independent 
of 1. The expectation above is taken with respect to the predictive distribution 
of y* given y and x*. So the optimal predictive model will be found by finding 
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the minimizer of the expression obtained by taking a further expectation over 
x* on the second quantity on the right hand side of (9.65). By plugging in the 
values of yf and y*, we immediately get 


(g* — of)? = (MiB — B)’x*’x* (AB, — B). (9.66) 


The lemma follows. The uniqueness follows from the fact that Q is positive 
definite. O 


Lemma 9.6. (Barbieri and Berger, 2004) If Q is diagonal with diagonal ele- 
ments qi > 0, and the posterior means Bı satisfy 3, = Hy’ B (where 3B is the 
posterior mean under the full model as in (9.54)) then 


k 
=e feo, > (9.67) 


Proof. From the fact By = Hy’, it follows that 


B= S > mA = >) pHH' 8 = D(p)s, (9.68) 
1 1 


where D(p) is the diagonal matrix with diagonal elements p;, by noting that 


H(i, j)=1 if 4 =1 and j = 4° l, and H(i, j) = 0 otherwise. Similarly, 


R(M)) = (H\Hy’B — D(p)B)'Q(HiHy'B — D(p)8) 
= Š (D0) — D(p))Q(D() — D(p))B, (9.69) 


from where the result follows. O 


Remark 9.7. The condition By = Hy 8, simply means that the posterior mean 
of Bı is found by taking the relevant coordinates of the posterior mean in 
the full model as in (9.54). As Barbieri and Berger (2004) comment, this 
will happen in two important cases. Assume X’X is diagonal. In the first 
case, if one uses the reference prior 7(1,0) = 1/o or a constant prior if o 
is known, the LSE becomes same as the posterior means and the diagonality 
of (X’X) implies that the above condition will hold. Secondly, suppose in the 
full model 7(8, o) = Nz (pt, 07A) where A is a known diagonal matrix, and for 
the submodels the natural corresponding prior Np, (Hy p,o*H)’ AH;). Then 
it is easy to see that for any prior on g? or if a? is known, the above will hold. 


We now state the first theorem. 


Theorem 9.8. (Barbieri and Berger, 2004) If Q is diagonal with qi > 0 and 
Bı = Hy B, and the models have graphical model structure, then the median 
probability model is the best predictive model. 
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Proof. Because q; > 0, By > 0 for each i and p; (defined in (9.62)) does 
not depend on 1, to minimize R(M)) among all possible models, it suffices 
to minimize (l; — p;)? for each individual i and that is achieved by choosing 
l; = 1 if p; > 5 and l; = 0 if p; < E, whence l as defined will be the median 
probability model. The graphical model structure ensures that this model is 
among the class of models under consideration. O 


Remark 9.9. The above theorem obviously holds if we consider all submodels, 
this class having graphical model structure; provided the conditions of the 
theorem hold. By the same token, the result will hold under the situation 
where the models under consideration are nested. 


Corollary 9.10. (Barbieri and Berger, 2004) If the conditions of the above 
theorem hold, all submodels of the full model are allowed, o? is known, X'X 
is diagonal and 8;’s have N (pmi, Aio?) distributions and 


k 
P(My) = [ [PD — p), (9.70) 


t=1 


where p? is the prior probability that variable x; is in the model, then the 
optimal predictive model is the model with highest posterior probability which 
is also the median probability model. 


Proof. Let B; be the least squares estimate of (; under the full model. Because 
X’X is diagonal, §;’s are independent and the likelihood under M; factors as 


k 
L(Mi) x TJAN (yt 


w=1 


where à? depends only on B; and ĝi, A; depends only on Ĝĝ; and the constant 
of proportionality here and below depend an Y and ĝ;’s. 
Also, the conditional prior distribution of G,;’s given Mı has a factorization 


k 
(B|My) = | [IN (ui, A0? [60 


i=1 


where 6{0} = degenerate distribution with all mass at zero. 
It follows from (9.70) and the above two factorizations that the posterior 
probability of M, has a factorization 


OO 


k 
P(M|¥ )o | [{p? J AON (mi, Aso) dB} {(1 — p9)ALS{O} Y 


i=l F20 


which in turn implies that the marginal posterior of including or not including 
ith variable is proportional to the two terms respectively in the zth factor. This 


completes the proof, vide Problem 21. (The integral can be evaluated as in 
Chapter 2.) O 
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We have noted before that if the conditions in Theorem 9.8 are satis- 
fied and the models are nested, then the best predictive model is the median 
probability model. Interestingly even if Q is not necessarily diagonal, the best 
predictive model turns out to be the median probability model under some 
mild assumptions, in the nested model scenario. Consider 
Assumption 1: Q = yX’X for some y > 0, i.e., the prediction will be made 
at covariates that are similar to the ones already observed in the past. 
Assumption 2: 3) = bÂ, where b > 0, i.e, the posterior means are propor- 
tional to the least squares estimates. 


Remark 9.11. Barbieri and Berger (2004) list two situations when the second 
assumption will be satisfied. First, if one uses the reference prior m(81, o) = 
1/o, whereby the posterior means will be the LSE’s. It will also be satisfied 
with b = ¢/(1 + c), if one uses g-type normal priors of Zellner (1986), where 
™(3\\0) ~ Nz, (0, co*(X}Xq)~*) and the prior on ø is arbitrary. 


Theorem 9.12. For a sequence of nested models for which the above two 
conditions hold, the best predictive model is the median probability model. 


Proof. See Barbieri and Berger (2004). o 


Barbieri and Berger(2004, Section 5) present a geometric formulation for 
identification of the optimal predictive model. ‘They also establish conditions 
under which the median probability model and the maximum posterior prob- 
ability model coincides; and that it is typically not enough to know only 
the posterior probabilities of each model to determine the optimal predictive 
model. 

Till now we have concentrated on some Bayesian approaches to the predic- 
tion problem. It turns out that model selection based on the classical Akaike 
information criterion (AIC) also plays an important role in Bayesian pre- 
diction and estimation for linear models and function estimation. Optimality 
results for AIC in classical statistics are due to Shibata (1981, 1983), Li (1987), 
and Shao (1997). 

The first Bayesian result about AIC is taken from Mukhopadhyay (2000). 
Here one has observations {y;; : i = 1,...,p, j =1,...,7r, n = pr} given by 


Yij = Hi + Eijs (9.71) 


where e;; are i.i.d. N (O, ga?) with o? known. The models are M, : p; = 0 for all 
i and My: n? = limp->o 5 }i=1 Hi > 0. Under M2, we assume a N(0,77J,) 
prior on u where 7? is to be estimated from data using an empirical Bayes 
method. It is further assumed that p > oo as n — oo. The goal is to predict a 
future set of observations {z;; } independent of {y;; } using the usual prediction 
error loss, with the ‘constraint’ that once a model is selected, least squares 
estimates have to be used to make the predictions. Theorem 9.13 shows that 
the constrained empirical Bayes rule is equivalent to AIC asymptotically. A 
weaker result is given as Problem 17. 
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Theorem 9.13. (Mukhopadhyay, 2000) Suppose Mz is true, then asymptot- 
ically the constrained empirical Bayes rule and AIC select the same model. 
Under Mı, AIC and the constrained empirical Bayes rule choose Mı with 
probability tending to 1. Also under Mı, the constrained empirical Bayes rule 
chooses M, whenever AIC does so. 


The result is extended to general nested problems in Mukhopadhyay and 
Ghosh (2004a). It is however also shown in the above reference that if one 
uses Bayes estimates instead of least squares estimates, then the unconstrained 
Bayes rule does better than AIC asymptotically. The performance of AIC in 
the PEB setup of George and Foster (2000) is studied in Mukhopadhyay and 
Ghosh (2004a). 

As one would expect from this, AIC also performs well in nonparametric 
regression which can be formulated as an infinite dimensional linear prob- 
lem. It is shown in Chakrabarti and Ghosh (2005b) that AIC attains the 
optimal rate of convergence in an asymptotically equivalent problem and is 
also adaptive in the sense that it makes no assumption about the degree of 
smoothness. Because this result is somewhat technical, we only present some 
numerical results for the problem of nonparametric regression. 

In the nonparametric regression problem 


Y; = f(-) +4, i=1,...,n, (9.72) 


one has to estimate the unknown smooth function f. In Table 9.3, we con- 
sider n = 100 and f(x) = (sin (272))*, (cos (xx))*, 7+cos (272), and e8!" 272), 
the loss function L(f, f) = h (f(x) — f(x))?dzx, and report the average loss 
of modified James-Stein estimator of Cai et al. (2000), AIC, and the ker- 
nel method with Epanechnikov kernel in 50 simulations. To use the first two 
methods, we express f in its (partial sum) Fourier expansion with respect to 
the usual sine-cosine Fourier basis of [0,1] and then estimate the Fourier coef- 
ficients by the regression coefficients. Some simple but basic insight about the 
AIC may be obtained from Problems 15-17. It is also worth remembering that 
AIC was expected by Akaike to perform well in high-dimensional estimation 
or prediction problem when the true model is too complex to be in the model 
space. 


9.9 Discussion 


Bayesian model selection is passing through a stage of rapid growth, especially 
in the context of bioinformatics and variable selection. The two previous sec- 
tions provide an overview of some of the literature. See also the review by 
Ghosh and Samanta (2001). For a very clear and systematic approach to dif- 
ferent aspects of model selection, see Bernardo and Smith (1994). 

Model selection based on AIC is used in many real-life problems by Burn- 
ham and Anderson (2002). However, its use for testing problems with 0-1 
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Table 9.3. Comparison of Simulation Performance of Various Estimation Methods 
in Nonparametric Regression 


[Sinera] 
Cos] 
0.082 
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loss is questionable vide Problem 16. A very promising new model selection 
criterion due to Spiegelhalter et al. (2002) may also be interpreted as a gen- 
eralization of AIC, see, e.g., Chakrabarti and Ghosh (2005a). In the latter 
paper, GBIC is also interpreted from the information theoretic point of view 
of Rissanen (1987). 

We believe the Bayesian approach provides a unified approach to model 
selection and helps us see classical rules like BIC and AIC as still important 
but by no means the last word in any sense. We end this section with two 
final comments. 

One important application of model selection is to examine model fit. 
Gelfand and Ghosh (1998) (see also Gelfand and Dey (1994)) use leave-k- 
out cross-validation to compare each collection of k data points and their 
predictive distribution based on the remaining observations. Based on the 
predictive distributions, one may calculate predicted values and some measure 
of deviation from the k observations that are left out. An average of the 
deviation over all sets of k left out observations provides some idea of goodness 
of fit. Gelfand and Ghosh (1998) use these for model selection. Presumably, the 
average distance for a model can be used for model check also. An interesting 
work of this kind is Bhattacharya (2005). 

Another important problem is computation of the Bayes factor. Gelfand 
and Dey (1994) and Chib (1995) show how one can use MCMC calcula- 
tions by relating the marginal likelihood of data to the posterior via P(y) = 
L(@ly)P(@)/P(@ly). Other relevant papers are Carlin and Chib (1995), Chib 
and Greenberg (1998), and Basu and Chib (2003). There are interesting sug- 
gestions also in Gelman et al (1995). 


9.10 Exercises 


1. Show that m(72|X) is an improper density if we take n(m, 2) = 1/ne in 
Example 9.3. 

2. Justify (9.2) and (9.3). 

3. Complete the details to implement Gibbs sampling and E-M algorithm in 
Example 9.3 when u and g? are unknown. Take 1(m, 07, 72) = 1/07. 

4. Let X;’s be independent with density f(x/6;), i = 1,2,...,p, 0; E R. 
Consider the problem of estimating 8 = (01,...,6,)’ with loss function 
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10. 


11. 
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P p 
L(0,a) = X L(6:,a:) =X (0i -a:)}, 0,a ERP. 


i.e., the total loss is the sum of the losses in estimating 6; by a;. An 
estimator for @ is the vector (T)(X),To(X),...,T,(X)). We call this a 
compound decision problem with p components. 

(a) Suppose sup; f(x|ô) = f(<|T(x)), ie., T(x) is the MLE (of 6; in 
f(x|6;)). Show that (T(.X1), T(X2),...T(Xp)) is the MLE of 8. 

(b) Suppose T(X) (not necessarily the T(X) of (a)) satisfies the suffi- 
cient condition for a minimax estimate given at the end of Section 1.5. 
Is (T(X1), T(X2),.-..,T(Xp)) minimax for 0 in the compound decision 
problem? 

(c) Suppose T(X) is the Bayes estimate with respect to squared error loss 
for estimating 0 of f(z|@). Is (T(X1),-..,T(Xp)) a Bayes estimate for 6? 
(d) Suppose T = (7,(X1),...,Tp(Xp)) and T;(X;) is admissible in the 
jth component decision problem. Is F admissible? 


. Verify the claim of the best unbiased predictor (9.17). 
. Given the hierarchical prior of Section 9.3 for Morris’s regression setup, 


calculate the posterior and the Bayes estimate as explicitly as possible. 
Find the full conditionals of the posterior distribution in order to imple- 
ment MCMC. 


. Prove the claims of superiority made in Section 9.4 for the James-Stein- 


Lindley estimate and the James-Stein positive part estimate using Stein’s 
identity. 


. Under the setup of Section 9.3, show that the PEB risk of 6; is smaller 


than the PEB risk of Y;. 


. Refer to Sections 9.3 and 9.4. Compare the PEB risk of 6; and Stein’s 


frequentist risk of @ and show that the two risks are of the same form but 
one has E(B) and the other E9(B). (Hint: See equations (1.17) and (1.18) 
of Morris (1983)). 
Consider the setup of Section 9.3. Show that B is the best unbiased esti- 
mate of B. 
(Disease mapping) (See Section 10.1 for more details on the setup.) Sup- 
pose that the area to be mapped is divided into N regions. Let O; and E; 
be respectively the observed and expected number of cases of a disease in 
the ith region, i = 1,2,..., N. The unknown parameters of interest are 9;, 
the relative risk in the ith region, 2 = 1,2,..., N. The traditional model 
for O; is the Poisson model, which states that given (6),...,9n), O;’s are 
independent and 

O;|6; ~ Poisson (E;0;). 


Let 01,02,...,9n be iid. ~ Gamma(a,b). Find the PEB estimates of 
0,,02,...,9n. In Section 10.1, we will consider hierarchical Bayes analysis 
for this problem. 


12. 


13. 


14. 


15. 


16. 


17. 
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Let Y; be iid N(0;, V), i = 1,2,...,p. Stein’s heuristics (Section 9.4) 
shows ||Y ||? is too large in a frequentist sense. Verify by a similar argu- 
ment that if 6; are i.i.d uniform on R then ||Y ||? is too small in an improper 
Bayesian sense, i.e., there is extreme divergence between frequentist prob- 
ability and naive objective Bayes probability in a high-dimensional case. 
(Berger (1985a, p. 542)) Consider a multiparameter exponential family 
f(x|0) = c(@) exp(@’T (a) )h(x), where æ and @ are vectors of the same di- 
mension. Assuming Stein’s loss, show that (under suitable conditions) the 
Bayes estimate can be written as gradient(log m(a)) — gradient(log h(a)) 
where m(x) is the marginal density of x obtained by integrating out 8. 
Simulate data according to the model in Example 9.3, Section 9.1. 

(a) Examine how well the model can be checked from the data X;;, t = 
1, Qe catty e O 0 

(b) Suppose one uses the empirical distribution of X;’s as a surrogate 
prior for 44;’s. Compare critically the Bayes estimate of p for this prior 
with the PEB estimate. 

(Stone’s problem) Let 

Yij = A+ Hit Eiz, Eij ~ N(0,o07),4=1,2,...,.p,5 = 1,2,...,7, = pr with 
a°? assumed known or estimated by S° = S07_, 054 (Yi — Yi)*/p(r — 1). 
The two models are 


My: p; = OVi and Mo : pe RP. 


Suppose n — œ, plogn/n = œ and X $_ (mi — A) /(p—-1)>7° > 0. 
(a) Show that even though Mg is true, BIC will select M; with probability 
tending to 1. Also show that AIC will choose the right model Mə with 
probability tending to one. 

(b) As a Bayesian how important do you think is this notion of consis- 
tency? 

(c) Explore the relation between AIC and selection of model based on 
estimation of residual sum of squares by leave-one-out cross validation. 
Consider an extremely simple testing problem. X ~ N(u,1). You have to 
test Hp : u = 0 versus Hı : u Æ 0. Is AIC appropriate for this? Compare 
AIC, BIC, and the usual likelihood ratio test, keeping in mind the conflict 
between P-values and posterior probability of the sharp null hypothesis. 
Consider two nested models and an empirical Bayes model selection rule 
with the evaluation based on the more complex model. Though you know 
the more complex model is true, you may be better off predicting with 
the simpler model. 

Let- Yay =a Peta Ld NO o hi = eps 9 SND to Wh 
known g?. The models are 


Mı : H = 0 
Mz: € R?, u ~ N,(0,77Ip), T? > 0. 


(a) Assume that in PEB evaluation under Mz you estimate 7° by the 
moment estimate: 
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18. 


19. 


20. 


21. 
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Show with PEB evaluation of risk under Mz and M1, Y is preferred if and 
only if AIC selects Mo. 
(b) Why is it desirable to have large p in this problem? 
(c) How will you try to justify in an intuitive way occasional choice of the 
simple but false model? 
(d) Use (a) to motivate how the penalty coefficient 2 arises in AIC. 
(This problem is based on a result in Mukhopadhyay (2001)). 
Burnham and Anderson (2002) generated data to mimic a real-life exper- 
iment of Stromberg et al. (1998). Select a suitable model from among 
the 9 models considered by Ghosh and Samanta (2001). The main issue is 
computation of the integrated likelihood under each model. You can try 
Laplace approximation, the method based on MCMC suggested at the 
end of Section 9.9, and importance sampling. All methods are difficult, 
but they give very close answers in this problem. The data and the models 
can be obtained from the Web page 
http://www.isical.ac.in/~tapas/book 
Let X; ~ N(w,1),72 =1,...,n and u ~ N(m, 72). Find the PEB estimate 
of ņı and 72 and examine its implications for the inadequacy of the PEB 
approach in low-dimensional problems. 
Consider NPEB multiple testing (Section 9.6.1) with known 7, and an 
estimate f of (1-71) fo +r fı. Suppose for each i, you reject Ho; : pi = 0 
if 

folzi) < f(xi)a, whereo <a < 1. 


Examine whether this test provides any control on the (frequentist) FDR. 
Define a Bayesian FDR and examine if, for small mı, this is also con- 
trolled by the test. Suggest a test that would make the Bayesian FDR 
approximately equal to a. (The idea of controlling a Bayesian FDR is due 
to Storey (2003). The simple rules in this problem are due to Bogdan, 
Ghosh, and Tokdar (personal communication).) 

For all subsets variable selection models show that the posterior median 
model and the posterior mode model are the same if 


P 


P(Mi|X) = | | pF (1 - pi) 


¿=1 


where l = 1 if the ith variable is included in M; and l; = 0 otherwise. 
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Some Applications 


The popularity of Bayesian methods in recent times is mainly due to their suc- 
cessful applications to complex high-dimensional real-life problems in diverse 
areas such as epidemiology, microarrays, pattern recognition, signal process- 
ing, and survival analysis. This chapter presents a few such applications to- 
gether with the required methodology. We describe the method without going 
into the details of the critical issues involved, for which references are given. 
This is followed by an application involving real or simulated data. 

We begin with a hierarchical Bayesian modeling of spatial data in Sec- 
tion 10.1. This is in the context of disease mapping, an area of epidemio- 
logical interest. The next two sections, 10.2 and 10.3, present nonparametric 
estimation of regression function using wavelets and Dirichlet multinomial al- 
location. They may also be treated as applications involving Bayesian data 
smoothing. For several recent advances in Bayesian nonparametrics, see Dey 
et al. (1998) and Ghosh and Ramamoorthi (2003). 


10.1 Disease Mapping 


Our first application is from the area of epidemiology and involves hierarchical 
Bayesian spatial modeling. Disease mapping provides a geographical distribu- 
tion of a disease displaying some index such as the relative risk of the disease 
in each subregion of the area to be mapped. Suppose that the area to be 
mapped is divided into N regions. Let O; and E; be respectively the observed 
and expected number of cases of a disease in the zth region, 2 = 1,2,...,N. 
The unknown parameters of interest are ĝ;, the relative risk in the ith re- 
gion, 2 = 1,2,...,N. Here E; is a simple-minded expectation assuming all 
regions have the same disease rate (at least after adjustment for age), vide 
Banerjee et al. (2004, p. 158). The relative risk 6; is the regional effect in a 
multiplicative model of expected number of cases: E(O;) = E;0;. If 0; = 1, we 
have E'(O;) = FE. The objective is to make inference about 6;’s across regions. 
Among other things, this helps epidemiologists and public health professionals 
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to identify regions or cluster of regions having high relative risks and hence 
needing attention and also to identify covariates causing high relative risk. 
The traditional model for O; is the Poisson model, which states that given 
(0,,...,9n), Ors are independent and 


O;|6; ~ Poisson (£;6;). (10.1) 


Under this model #;’s are assumed fixed. The classical maximum likelihood 
estimate of 8; is 6; = O; / Ei, known as the standardized mortality ratio (SMR) 
for region 7 and Var(6;) = ĝ;/ E;, which may be estimated as 6; / E;. However, 
it was noted in Chapter 9 that the classical estimates may not be appropriate 
here for simultaneous estimation of the parameters 01, 02,..., 0N. 

As mentioned in Chapter 9, because of the assumption of exchangeability 
of 6,,...,4n, there is a natural Bayesian solution to the problem. A Bayesian 
modeling involves specification of prior distribution of (41,...8y). Clayton 
and Kaldor (1987) followed the empirical Bayes approach using a model that 
assumes 


ði, Ao, PEIE On iid. ~ Gamma (a, b) (10.2) 


and estimating the hyperparameters a and b from the marginal density of 
{O;} given a,b (see Section 9.2). Here we present a full Bayesian approach 
adopting a prior model that allows for spatial correlation among the 6;’s. A 
natural extension of (10.2) could be a multivariate Gamma distribution for 
(6,,...,9n). We, however, assume a multivariate norma! distribution for the 
log-relative risks log@;, 7 = 1,...,N. The model may also be extended to 
allow for explanatory covariates x; which may affect the relative risk. Thus 
we consider the following hierarchical Bayesian model 


O;|0; are independent ~ Poisson (E;6;) (10.3) 

where log6; = @73+ Qi, i=1,...,N. 
The usual prior for @ = (¢),...,¢n) is given by the conditionally autore- 
gressive (CAR) model (Besag, 1974), which is briefly described below. For 


details see, e.g., Besag (1974) and Banerjee et al. (2004, pp. 79-83, 163, 164). 
Suppose the full conditionals are specified as 


gilni i~ N(D_ ajja?) 1=1,2,...,N. (10.4) 
j+i 
These will lead to a joint distribution having density proportional to 


exp {-3e'D 4 E Ayo} (10.5) 


where D = Diag(o?,...,0,) and A = (a:;) Nx. We look for a model that al- 
lows for spatial correlation and so consider a model where correlation depends 
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on geographical proximity. A proximity matrix W = (w,;) is an N x N matrix 
where w;; spatially connects regions 7 and 7 in some manner. We consider here 
binary choices. We set wi; = 0 for all 7, and for i Æ j, wi; = 1 if2 is a neighbor 
of j, i.e., 2 and 7 share some common boundary and w;; = 0 otherwise. Also, 
wij’s in each row may be standardized as Wij = wi;/Wio where Wio = $; Wij 
is the number of neighbors of region 7. Returning to our model (10.5), we now 
set Qij = awi;/Wio and o? = À/Wio. Then (10.5) becomes 


exp f- 50 Da = aW)d} 


where Dy = Diag(wyo, w20,.--,wWno). This also ensures that D~'(I — A) = 
<(Dy — aW) is symmetric. 
Thus the prior for œ is multivariate normal 


o ~ N(0, X) with X = A(Du — aW)7". (10.6) 


We take 0 < a < 1, which ensures propriety of the prior and positive spatial 
correlation; only the values of a close to 1 give enough spatial similarity. For 
a = 1 we have the standard improper CAR model. One may use the improper 
CAR prior because it is known that the posterior will typically emerge as 
proper. For this and other relative issues, see Banerjee et al. (2004). 

Having specified priors for all the unknown parameters including the spa- 
tial variance parameter A and propriety parameter a (0 < a < 1), one can 
now do Bayesian analysis using MCMC techniques. We illustrate through an 
example. 


Example 10.1. Table 10.1 presents data from Clayton and Kaldor (1987) on 
observed (O;) and expected (E;) cases of lip cancer during the period 1975- 
1980 for N = 56 counties of Scotland. Also available are x;, values of a co- 
variate, the percentage of the population engaged in agriculture, fishing, and 
forestry (AFF), for the 56 counties. The log-relative risk is modeled as 


where the prior for (¢1,...,¢@n) is as specified in (10.6). We use vague priors 
for $9 and 6; and a prior having high concentration near 1 for the parameter 
a. The data may be analyzed using WinBUGS. A WinBUGS code for this 
example is put in the web page of Samanta. A part of the results — the Bayes 
estimates 6; of the relative risks for the 56 counties — are presented in ‘Table 
10.1. The 6;’s are smoothed by pooling the neighboring values in an automatic 
adaptive way as suggested in Chapter 9. The estimates of o and (6; are 
obtained as Bo = —().2923 and ĝ = = 0.3748 with estimates of posterior s.d. 
equal to 0.3426 and 0.1325, respectively. 
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Table 10.1. Lip Cancer Incidence in Scotland by County: Observed Numbers (O,), 
Expected Numbers (E£;), Values of the Covariate AFF (z;), and Bayes Estimates of 


the Relative Risk (8;). 
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10.2 Bayesian Nonparametric Regression Using Wavelets 


Let us recall the nonparametric regression problem that was stated in Exam- 
ple 6.1. In this problem, it is of interest to fit a general regression function to 
a set of observations. It is assumed that the observations arise from a real- 
valued regression function defined on an interval on the real line. Specifically, 
we have 

yi = glt) +e, t=1,...,n, and 2; €7T, (10.8) 


where g; are i.i.d. N (O, g?) errors with unknown error variance g°, and g is a 
function defined on some interval T C Rå. 
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It can be immediately noted that a Bayesian solution to this problem 
involves specifying a prior distribution on a large class of regression functions. 
In general, this is a rather difficult task. A simple approach that has been 
successful is to decompose the regression function g into a linear combination 
of a set of basis functions and to specify a prior distribution on the regression 
coefficients. In our discussion here, we use the (orthonormal) wavelet basis. 
We provide a very brief non-technical overview of wavelets including multi- 
resolution analysis (MRA) here, but for a complete and thorough discussion 
refer to Ogden (1997), Daubechies (1992), Hernandez and Weiss (1996), Miiller 
and Vidakovic (1999), and Vidakovic (1999). 


10.2.1 A Brief Overview of Wavelets 


Consider the function 


1 Oa = 1/2; 
w(r)=< —1 2a (10.9) 
0 otherwise. 


which is known as the Haar wavelet, simplest of the wavelets. Note that its 
dyadic dilations along with integer translations, namely, 


pikle) = PPy(e—k), j,kez, (10.10) 


provide a complete orthonormal system for L? (R). This says that any f € 
L? (R) can be approximated arbitrarily well using step functions that are 
simply linear combinations of wavelets Y; (£). What is more interesting and 
important is how a finer approximation for f can be written as an orthogonal 
sum of a coarser approximation and a detail function. In other words, for 
74€ Z, let 


V;= { f E L?(R): f is piecewise constant on intervals 
k27, (k +1)275), k € zh. (10.11) 
Now suppose P/f is the projection of f € L?(R) onto V;. Then note that 


Pfarr gig 
SPFS < Sak Vak (10.12) 


keZ 
with g/—! being the detail function as shown, so that 
Vj = Vj-1 © Wy, (10.13) 


where W}; = span {4%; k, k € Z}. Also, corresponding with the ‘mother’ wavelet 
w (Haar wavelet in this case), there is a father wavelet or scaling function 
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$ = Io, such that V; = span {¢;,4,k E€ Z}, where ¢; is the dilation and 
translation of ¢ similar to the definition (10.10), i.e., 


din (x) = 2 (r-k), j,k Ee Z, (10.14) 
In fact, the sequence of subspaces {V;} has the following properties: 
Lie C V2 C Vig -C V6 C Vi C Vo Cone 
NjezVj = {0}, Ujez V; = L7(R). 
: f = V; iff f(2.) E Vj41- 
. f € Vo implies f(. —k) € Vo for all k E€ Z. 
. There exists ¢ € Vo such that span{do, = ¢(. — k), k € Z} = Vo. 


ne wh 


Given this ¢, the corresponding 7% can be easily derived (see Ogden (1997) 
or Vidakovic (1999)). What is interesting and useful to us is that there ex- 
ist scaling functions @ with desirable features other than the Haar function. 
Especially important are Daubechies wavelets that are compactly supported 
and each having a different degree of smoothness. 


Definition: Closed subspaces {V;};ez satisfying properties 1—5 are said to 
form a multi-resolution analysis (MRA) of £°(R). If Vj = span {¢;,,,k E€ Z} 
form an MRA of L? (R), then the corresponding ¢ is also said to generate this 
MRA. 


In statistical inference, we deal with finite data sets, so wavelets with 
compact support are desirable. Further, the regression functions (or density 
functions) that we need to estimate are expected to have certain degree of 
smoothness. Therefore, the wavelets used here should have some smoothness 
also. The Haar wavelet does have compact support but is not very smooth. In 
the application discussed later, we use wavelets from the family of compactly 
supported smooth wavelets introduced by Daubechies (1992). These, however, 
cannot be expressed in closed form. A sketch of their construction is as follows. 

Because, from property 5 above of MRA, @ € Vo C Vi, we have 


plx) = X hedbrx(2), (10.15) 


kez 


where the ‘filter’ coefficients hy, are given by 
hk =< 6,01,6 >= v2 | o(@o(2x — k) drz. (10.16) 


For compactly supported wavelets @, only finitely many h;’s will be non-zero. 
Define the 27-periodic trigonometric polynomial 


l | 
Molw) = — hipe "e 10.17 
(w) Fi oe k ( ) 


associated with {hx}. The Fourier transforms of ġ and w can be shown to be 
of the form 
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OO 

== mo(274w 10.18 

Lm) (10.18) 


Bw) = -e-#?mo(= +r). (10.19) 


Depending on the number of non-zero elements in the filter {hg}, wavelets of 
different degree of smoothness emerge. 

It is natural to wonder what is special about MRA. Smoothing techniques 
such as linear regression, splines, and Fourier series all try to represent a 
signal in terms of component functions. At the same time, wavelet-based MRA 
studies the detail signals or differences in the approximations made at adjacent 
resolution levels. This way, local changes can be picked up much more easily 
than with other smoothing techniques. 

With this short introduction to wavelets, we return to the nonparametric 
regression problem in (10.8). Much of the following discussion closely follows 
Angers and Delampady (2001). We begin with a compactly supported wavelet 
function w € C*, the set of real-valued functions with continuous derivatives 
up to order s. We note that then g has the wavelet decomposition 


=> Abe (x y+ > ` Bikbj, khz (10.20) 


|k|< Ko 920 |k|<K; 
with 


k(x) = d(x —k), and 
b;,n(x) = 27/7 ob(2) — k), 
where K; is such that ¢,(xz) and Y; k(x) vanish on 7 whenever |k| > Kj, and 
@ is the scaling function (‘father wavelet’) corresponding with the ‘mother 
wavelet’ w. Such K,’s exist (and are finite) because the wavelet function that 


we have chosen has compact support. For any specified resolution level J, we 
have 


g(x) = > An Pe (x 753 ` Prb )+ 5S ` Onk? 


Ik|<Ko j=0 |ki<K; j=J+1 [kI<K; 
= gj(x) + Rj(x), (10.21) 


where 


gs (x) = 2. An E(x) Da `o BikYik(z), and 


Ik] <Ko j>0 |k|<K; 


3 XO Byat;,e(2). (10.22) 


j=J+1 |b <K; 
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In the representation (10.22), we note that the ¢ functions appearing in the 
first part detect the global features of g, and subsequently the w functions in 
the second part check for local details. 

To proceed further, many standard wavelet based procedures apply the 
‘discrete wavelet transform’ to the data and work with the resulting wavelet 
coefficients (see Vidakovic (1999), Müller and Vidakovic (1999)). We, how- 
ever, use the familiar hierarchical Bayesian approach to specify the prior model 
for g in (10.8). At the resolution level J, (10.8) can be expressed as 


Yi = gJ (Ti) +i + £i, (10.23) 


where ņ; = Ry(x;). Because the amount of information available in the likeli- 
hood function to estimate the infinitely many parameters bjk, j > J, |k| < K; 
(arising from the higher levels of resolution and appearing in 7;) is very lim- 
ited, it is best to treat these 7; as nuisance parameters and eliminate them 
by integrating out with respect to the prior given in (10.24) while estimating 
gz. Otherwise, one will need to elicit some very informative prior on these pa- 
rameters, thus attracting prior robustness issues as well. One other important 
issue is how large J should be. Note that the number of unknown parameters 
in the model grows exponentially with J, so it cannot be large for practical 
reasons. Also, there is no need for large J because its purpose is to check for 
local details only. 


10.2.2 Hierarchical Prior Structure and Posterior 
Computations 


In the first-stage prior specification, a, and @;, are all assumed to be inde- 
pendent normal random variables with mean 0. A common prior variance of 
T? is assigned for œk, whereas to accommodate the decreasing effect of the 
‘detail’ coefficients jı, their variance is assumed to be 2-2is72. Now a joint 
prior distribution on g? and T? completes the prior specification. Even though 
conditionally, given T°, a, and jı are normally distributed, unconditionally 
they do have heavy tailed prior distributions possessing robustness properties. 

Let us now introduce some notations to facilitate the derivation of pos- 
terior quantities. Let y = (a’,@’)’, where a = (@œk)ik|<Ko, and B = 
(3k )o<j<J,\k|<K;- Then the first stage prior specified above is 


T 0 
yir? ~ N2Ko+1+Mz (0, TIN, where J’ = ( r A ; 


with Mg = yack ; + 1) and the diagonal matrix A being the variance- 
covariance matrix of 8. Also, 


n= (M, -- -mY |T” ~ Nn(0,77Qn); (10.24) 


where, to keep the covariance structure of 7; simple, we choose 
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(Qn)ig = 772-*** exp(—clari — z;l), 


for some moderate value of c. Further, let X = (6, S’) with the ith row of &’ 
being {$x (i) }ik]< K, and the ith row of S” being {jx (£i) }o<j<J,jkj<x, Then, 
given y, g? and 77, we have the following linear model for the observation 
vector Y = (y1,..-,Yn): 


Y = Xy+u, (10.25) 


where u = n +€ ~ Na (0, X) with X = o?In + 7°Qn. This follows from the 
fact that 


Yy, n,o, T? ~ Na( Xy +n, o° In), (10.26) 
n|’? B N, (0, Os). 


From (10.25) and using standard hierarchical Bayes techniques (cf. Lindley 
and Smith (1972)) and matrix identities (cf. Searle (1982)), it follows that 


Y|o7,7? ~ Na (0,071, +77 (XTX' 4+ Qn)), (10.27) 
yY, o”, T? ~ N(AY, B), (10.28) 
where 
A=7°PX' o erT AX FO 
B = I rt Xx! (o°In +7? (XTX' + Qa)) XT. 
To proceed to the second-stage calculations, some algebraic simplifications are 
needed (see Angers and Delampady (1992)). Spectral decomposition yields 


XIX'+Q, = HDH’, where D = diag(d,,do,...,d,) is the matrix of eigen- 
values and H is the orthogonal matrix of eigenvectors. Thus, 


o* In + 7° (XPX'+ Qn) =H (o7I, +7°D) H’ 
= 77H (vl, + D) H’, (10.29) 


where v = o7/r*. Using this spectral decomposition, the marginal density of 
Y given 7? and v can be written as 


1 1 


2 = e 
AN E= aa e rD 


1 
X exp {—gea¥/H(uln + py tH'y} 


: : e : 5 t (10.30) 
= —-. _, z exp i- —— oya, (10. 
(Qar7)"/? Ea ura P) 27 v+ di 


where t = (t1, ... tn} = H'Y. 
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To derive the wavelet smoother, all that we need to do now is to eliminate 
the hyper- and nuisance parameters from the first-stage posterior distribution 
by integrating out these variables with respect to the second-stage prior on 
them. This is what we will do now. Alternatively, one could employ an empiri- 
cal Bayes approach and estimate o? and 7° from equation (10.27) and replace 
g? and 7? by their estimates in equation (10.28) to approximate 7. However, 
this will underestimate the variance of the wavelet estimator, Y=xXx 4. Sup- 
pose, then, 72(77, v) is the second stage prior. It is well known in the context 
of hierarchical Bayesian analysis (see Chapter 9, specially equation (9.7) and 
Berger, 1985a) that the sensitivity of the second and higher stage hyper-priors 
on the final Bayes estimator is somewhat limited. Therefore, for computational 
ease, we choose 72(T2,v) = m22(v)(77)~% for some suitable choice of a > 0; 
122 is the prior specified for v. 

Once a and 72 are specified, using equation (10.28) along with (10.29) 
and taking the expectation with respect to 77, we have that 


E(y¥|Y)=7 =I X'HE [(vn + D) | Y]t, (10.31) 


where the expectation is taken with respect to 722(v | Y). Again using equa- 
tions (10.28) and (10.29), the posterior covariance matrix of ~y can be written 
as 


n 


e 
= Se a 


i=1 


1 W t 
— T'X'HE I DY) Y| WXT 
n+ 2a (Er) en o| | 


+E [y(v)¥(v)' | Y], (10.32) 


where ¥(v) = 'X’H(vI, + D)“'t. 

To compute these expectations, one can use several techniques. Because 
they involve only single dimensional integrals, standard numerical integration 
methods will work quite well. Several versions of the standard Monte Carlo 
approach can be employed quite satisfactorily and efficiently also. An example 
illustrating the methodology follows. 


1 
n + 2a 


T 








Var(y | Y) = 














Example 10.2. This is based on data provided by Prof. Abraham Verghese 
(F.R.E.S.) of the Indian Institute of Horticultural Research, Bangalore, India 
(personal communication), which have already been analyzed in Angers and 
Delampady (2001). The variable of interest y that we have chosen from the 
data set is the weekly average humidity level. The observations were made 
from June 1, 1995, to December 13, 1998. (For some reason, the observations 
were not recorded on the same day of the week every time.) We have chosen 
time (day of recording the observation) as the covariate z. (Any other available 
covariate can be used also because wavelet-based smoothing with respect to 
any arbitrary covariate (measured in some general way) can be handled with 
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Fig. 10.1. Wavelet smoother and its error bands for the Humidity data. 


our methodology.) For illustration purposes, we have chosen the model with 
J = 6; the hyperparameter a is 0.5 and the prior məz corresponds with an 
F distribution with degrees of freedom 24 and 4. We have used compactly 
supported Daubechies wavelets for this analysis. As explained earlier, these 
cannot be expressed in closed form, but computations with these wavelets 
are possible using any of the several statistical and mathematical software 
packages. In Figure 10.1, we have plotted g,; (solid line) along with its error 
bands (dotted lines), +2,/Var(y | Y), where 


Var(y | Y) = Var(gs(x) +n +€ | Y). 


More details on this example as well as other studies can be found in 
Angers and Delampady (2001). 


10.3 Estimation of Regression Function Using Dirichlet 
Multinomial Allocation 


In Section 10.2, wavelets are used to represent the nonparametric regression 
function in (10.8) and a prior is put on the wavelet coefficients. Here we 
present an alternative approach based on the observation that the unknown 
regression function is locally linear and hence one may use a high-dimensional 
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parametric family for modeling locally linear regression. Suppose we have 
a regression problem with a response variable Y and a regressor variable 
X. Let (X1,Yi),...(Xn, Yn) be independent paired observations on (X,Y). 
Consider first the usual normal linear regression model where given values of 
the regressor variables z;’s, the Y;’s are independently normally distributed 
with common variance c% and mean E(Y;|z;) = 8, + G22;, a linear function 
of Ly. 
Let Z; = (X;, Y;) be independent, Z; having the density 


Flzids) = Fla, yloi) = fx (£mi, 07) fy (ylz, Bris Boi, oF) 


where fx (x|u;,07) and fy (y|x, Bii, Goi, 7%) denote respectively N (u, 07) den- 
sity for X; and N (Bii + Gor, 0%) density for Y; given x, 6; = (pi, 07, Bii, Bos), 
ew E i S 

For simplicity we assume cĉ is known, say, equal to 1. 

For the remaining parameters ģġ;, i = 1,...,n, we have the Dirichlet multi- 
nomial allocation (DMA) prior, defined in the next paragraph. 

(1) Let k ~ p(k), a distribution on {1,2,...,n}. 

(2) Given k, ¢;, i = 1,...,n have at most k distinct values 6),...,0., 
where 6;’s are i.i.d. ~ Go and Go is a distribution on the space of (u, 07, 31, G2) 
(our choice of Go is mentioned below). 

(3) Given k, the vector of weights (w1,..., wg) ~ Dirichlet (ô1,..., ôk). 

(4) Allocation variables a;,...,a, are independent with 


Pia =j) =w aa E 


(9) Finaly o= yr t= Ieee: 

For simplicity, we illustrate with a known k (which will be taken ap- 
propriately large). We refer to Richardson and Green (1997) for the treat- 
ment of the case with unknown k; see also the discussion of this paper 
by Gruet and Robert, and Green and Richardson (2001). Under this prior 
pi = (Li, o2, Bii, Gai), i = 1,...,n are exchangeable. This allows borrowing 
of strength, as in Chapter 9, from clusters of (z;, y;)’s with similar values. To 
see how this works, one has to calculate the Bayes estimate through MCMC. 

We take Gop to be the product of a normal distribution for u, an inverse 
Gamma distribution for oĉ and normal distributions for 81, and 82. The full 
conditionals needed for sampling from the posterior using Gibbs sampler can 
be easily obtained, see Robert and Casella (1999) in this context. For example, 


the conditional posterior distribution of a,,...,a, given other parameters are 
as follows: 
k 
a; = j with probability w; f(Z;|0;)/ ` wr f (Z;|0,). 
r=ij 


jg=1,...,k, i= 1,...,n and aj,...,a, are independent. 
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Due to conjugacy, the other full conditional distributions can be easily 
obtained. You are invited to calculate the conditional posteriors in Problem 4. 

Note that given k, 01,...,@,% and w1,..., Wk, we have a mixture with k 
components. Each mixture models a locally linear regression. Because 6; and 
w; are random, we have a rich family of locally linear regression models from 
which the posterior chooses different members and assigns to each member 
model a weight equal to its posterior probability density. The weight is a 
measure of how close is this member model to data. The Bayes estimate of 
the regression function is a weighted average of the (conditional) expectations 
of locally linear regressions. 

We illustrate the use of this method with a set of data simulated form a 


model for which 
E(Y |x) = sin(2x) + €. 


We generate 100 pairs of observations (X;, Y;) with normal errors ¢«;. A 
scatter plot of the data points and a plot of the estimated regression at each 
X; (using the Bayes estimates of (1;, 82i) together with the graph of sin(2z) 





Fig. 10.2. Scatter plot, estimated regression, and true regression function. 
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are presented in Figure 10.2. In our calculation, we have chosen hyperparam- 
eters of the priors suitably to have priors with small information. Seo (2004) 
discusses the choice of hyperpriors and hyperparameters in examples of this 
kind. 

Following Miiller et al. (1996), Seo (2004) also uses a Dirichlet process 
prior instead of the DMA. The Dirichlet process prior is beyond the scope of 
our book. See Ghosh and Ramamoorthi (2003, Chapter 3) for details. 

It is worth noting that the method developed works equally well if X is 
non-stochastic (as in Section 10.2) or has a known distribution. The trick is 
to ignore these facts and pretend that X is also random as above. See Müller 
et al. (1996) for further discussion of this point. 


10.4 Exercises 


1. Verify that Haar wavelets generate an MRA of £7(R). 

2. Indicate how Bayes factors can be used to obtain the optimal resolution 
level J in (10.21). 

3. Derive an appropriate wavelet smoother for the data given in Table 5.1 
and compare the results with those obtained using linear regression in 
Section 5.4. 

4. For the problem in Section 10.3, explain how MCMC can be implemented, 
deriving explicitly all the full conditionals needed. 

5. Choose any of the high-dimensional problems in Chapters 9 or 10 and 
suggest how hyperparameters may be chosen there. Discuss whether your 
findings will apply to all the higher levels of hierarchy. 


A 


Common Statistical Densities 


For quick reference, listed below are some common statistical densities that are 
used in examples and exercise problems in the book. Only brief description 
including the name of the density, the notation (abbreviation) used in the 
book, the density itself, the range of the variable argument, and the parameter 


values and some useful moments are supplied. 


A.1l Continuous Models 


L. 


Univariate normal (N (u, 07)): 
Mnao aee e: 


ER E E EE e Ee S0. 


Mean = p, variance = g°. 


Special case: N (0, 1) is known as standard normal. 


. Multivariate normal (Np(p, X)): 


f (xl, X) = (2r) PPIE i exp (—(x — pY Sx- p)), 


X E€ RP, u E RP, Xpxp positive definite. 
Mean vector = u, covariance or dispersion matrix = 2’. 


. Exponential (Exp(à)): 


f(zlA) = Aexp(—Axz), z > 0, > 0. 


Mean = 1/,, variance = 1/27. 


. Double exponential or Laplace (DExp(1,0)): 


flalu, a) = = exp (-=) | 


Oo 


= < T <= 00; -0 < =< 00, o S OC. 
Mean = p, variance = 207. 
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ð. 


10. 


A Common Statistical Densities 


Gamma (Gamma(a, A)): 


f(zla, A) = AI xr% exp(—Az),x >0,a>0,A>0. 


Mean = a/A, variance = a/A?. 

Special cases: 

(i) Exp(A) is Gamma(l, A). 

(ii) Chi-square with n degrees of freedom (x2) is Gamma(n/2, 1/2). 


. Uniform (U(a, b)): 


f(zla, b) = Io, b) (£), —œ0 <a <b < o. 


Mean = (a+ b)/2, variance = (b — a)?/12. 


. Beta (Beta(a, ß)): 


(a+ B) 
P(a)I'(B) 


Mean = a/(a+), variance = af /{(a+ B)*(a+ 8+ 1)}. 
Special case: U(0,1) is Beta(1, 1). 


f(zla, b) = as T t)?~*I(o,1)(a), a > 0, 8 >Q. 


. Cauchy (Cauchy(p, 07)): 


1 =) ge Wl 
falma?) = È (1+ EPL) ,—00 < T < œ, 
TO 


g2 


~—oo < u < œ,0? > 0. Mean and variance do not exist. 


. t distribution (t(a, u, o°)): 


P —(œ&+1)/2 
f(zla, p, g?) = P((a + 1)/2) (: pa E-e) . 


oJ/anl(a/2) ao? 


—900 < £ <00,a >0,-00 < u < cw, 07 > 0. 

Mean = pifa>1, variance = ao*/(a — 2) if a > 2. 

Special cases: 

(i) Cauchy(p, 07) is (1, p, o°). 

(ii) t(k, 0,1) = ty, is known as Student’s t with k degrees of freedom. 
Multivariate t (tp(a@, p, X)): 

—(a+tp)/2 


a Pat+p)/2)) ai l uY- (x = 
Poa = n PR a n (1+ il pe)" ( w) 


x E€ RP, a > 0, p E RP, yxy positive definite. 
Mean vector = u if œ > 1, covariance or dispersion matrix = 
adi /(a — 2) ifa > 2. 


3 


Ide 


12. 


13. 


14. 
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F distribution with degrees of freedom a and 8 (F(a, 8)): 


r((&œ + 8)/2) m f z2/2-1 


E ALA — > r>0a>0,8>0. 
F(a/2)T (8/2) A 52) (a+B)/2 


f (tla, B) = 


Mean = 6/(8—2)if@>2, variance = 20*(a+8—2)/{a(G—4)(B—2)7} 
if 8 4, 

Special cases: 

(@) If X Ea h Ae a ~ F(1,¢). 

(ii) If X ~ tp(a, u, X), 3(X — HY XX — u) ~ F(p, a). 

Inverse Gamma (inverse Gamma(a, A)): 


f(ala, A) = at) exp(—A/z),2 >0,a>0,A>0. 


r(a) 
Mean = A/(a—1)ifa>1, variance = \7/{(a — 1} (a — 2)} ifa > 2. 


If X ~ inverse Gamma(a, A), 1/X ~ Gamma(a, À). 
Dirichlet (finite dimensional) (D(a)): 


f(xla) = 4 Ea) Tap m 


Fle: ı F(a) ¿=1 
MSGi dye) Whe UR ae h a Sk ye ti = ] and 
a = (074, i0240R) with a; > Odor 1<i< k. 


Mean vector = &œ/($_;—; ai), covariance or dispersion matrix = Ckxk 


where 
pee, 
k k 
Ci; nee Das ar) Coe arl) 
—~— T if i Fj. 
D out) Oo o1+1) 
Wishart (W,(n, X)): 


1 


[aa exp (—trace{ X71 A}/2) (APERU 
Apxp positive definite, Xpxp positive definite, n > p, p positive integer, 


ne = J exp (—trace{ A}) |A|@7+))/? dA, 
A positive definite 


for a > (p — 1)/2. 

Mean = n2’. For other moments, see Muirhead (1982). 

Special case: x2 is W1 (n, 1). 

If W~! ~ W,(n, X) then W is said to follow inverse-Wishart distribution. 
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15. Logistic ((Logistic(p, o)): 


1 exp(—=—* ) 

file o) ee 
“(en E) 

—o < T < 00, -V <u <%,0 >Q. 

Mean = p, variance = n?g?/3. 


A.2 Discrete Models 
1. Binomial (B(n, p)): 


fap) = (70-0, 


x = 0,1,... n, 0< p< 1, n > 1 integer. 
Mean = np, variance = np(1 — p). 
Special case: Bernoulli(p) is B(1, p). 
2. Poisson (P(A)): 
exp(—A)A 
f(a|n,p) = SPV 
t= E aos ASO. 
Mean = A, variance = À. 
3. Geometric (Geometric(p)): 


f(x|p) = (1 — p)*p, 
z=0,1,..,0<p<1. 


Mean = (1 — p)/p, variance = (1 — p)/p°. 
4. Negative binomial (Negative binomial(k, p)): 


flep) = ("P27 0-9? 


x=0,1,...,.0<p<1,k > 1 integer. 

Mean = k(1 — p)/p, variance = k(1 — p)/p?. 

Special case: Geometric(p) is Negative binomial(1, p). 
5. Multinomial (Multinomial(n, p)): 


f(x|n,p) = Tot 
Hi Ti! i=1 
x = (%,...,2%)’ with x; an integer between 0 and n, for 1 < i < k, 
J12: = n and p = (pi,..., pe)’ with 0 < p; < 1 for 1 < i< k, 
k 
Mean vector = np, covariance or dispersion matrix = Ckxk where 
AR npi(l — pi) if i = j; 
I | =npip; if 4. 


B 


Birnbaum’s Theorem on Likelihood Principle 


The object of this appendix is to rewrite the usual proof of Birnbaum’s the- 
orem (e.g., as given in Basu (1988)) using only mathematical statements and 
carefully defining all symbols and the domain of discourse. 


Let 6 € O be the parameter of interest. A statistical experiment €E is 
performed to generate a sample z. An experiment € is given by the triplet 
(X, A, p) where ¥ is the sample space, A is the class of all subsets of ¥, and 
p = {p(-|@),@ € O} is a family of probability functions on (X, A}, indexed by 
the parameter space ©. Below we consider experiments with a fixed parameter 
space O. 


A (finite) mixture of experiments €,,...,€, with mixture probabilities 
Ti,-.-, Tk (non-negative numbers free of 6, summing to unity), which may 
k 


be written as > mes is defined as a two stage experiment where one first 


t= 
selects €; with probability 7; and then observes x; € X; by performing the 
experiment €;. 


Consider now a class of experiments closed under the formation of (finite) 
mixtures. Let E€ = (¥,A,p) and €’ = (4’,A’,p’) be two experiments and 
£r € X,x' € X’. By equivalence of the two points (€,2) and (E', x’), we 
mean one makes the same inference on @ if one performs E and observes x or 
performs £’ and observes x’, and we denote this as 


(E, x) ~ (C52). 
We now consider the following principles. 


“ 29 


The likelihood principle (LP): We say that the equivalence relation “~ 
obeys the likelihood principle if (£, x) ~ (€’, x’) whenever 


p(x|@) = cp’(2’\@) for all 0 € O (B.1) 


for some constant c > 0. 
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6) 


The weak conditionality principle (WCP): An equivalence relation 
satisfies WCP if for a mixture of experiments € = a Miles 


(E, (7, x:)) g (Ei, xi) 


for any i € {1,...,k} and z; € Xi. 


The sufficiency principle (SP): An equivalence relation “~” satisfies SP if 
(E x) ~ (E£, x’) whenever S(x) = S(x’) for some sufficient statistic S for @ (or 
equivalently, S(x) = S(x’) for a minimal sufficient statistic S). 

It is shown in Basu and Ghosh (1967) (see also Basu (1969)) that for dis- 
crete models a minimal sufficient statistic exists and is given by the likelihood 
partition, i.e., the partition induced by the equivalence relation (B.1) for two 
points x, x’ from the same experiment. The difference between the likelihood 
principle and sufficiency principle is that in the former, x, x’ may belong to 
possibly different experiments while in the sufficiency principle they belong 
to the same experiment. 


The weak sufficiency principle (WSP): An equivalence relation “~” satis- 
fies WSP if (€, x) ~ (€, x’) whenever p(x|@) = p(x'|0) for all 8. 
If follows that SP implies WSP, which can be seen by noting that 


p(x|6) 


————__,#€0 
X p(z|6") 
9'e 


oa) = 


is a (minimal) sufficient statistic. We assume without loss of generality that 


X p(x|@) > 0 for alla € ¥. 
0EO 


We now state and prove Birnbaum’s theorem on likelihood principle (Birn- 
baum (1962)). 


Theorem B.1. WCP and WSP together imply LP, i.e., if an equivalence 
relation satisfies WCP and WSP then it also satisfies LP. 


Proof. Suppose an equivalence relation “~” satisfies WCP and WSP. Consider 
two experiments E) = (X1, A1, p1) and €2 = (X2, A2, p2) with same O and 
samples x; € Xi, i = 1,2, such that 


pi(x1|8) = cp2(x2/0) for all 0 € O (B.2) 


for some c > 0. 


We are to show that (E1, z1) ~ (E2, £2). Consider the mixture experiment 
E of €; and Ez with mixture probabilities 1/(1 +c) and c/(1+ c) respectively, 
l.e., 
c 
Lc 





= Ey + E2. 


l+e i 
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The points (1,21) and (2, z2) in the sample space of € have probabilities 
pi(21|9)/(1 + c) and po(x2|@)c/(1 + c), respectively, which are the same by 
(B.2). WSP then implies that 


(E, (1, 21)) ~ (E, (2, #2). (B.3) 
Also, by WCP 
(E; (1,21)) ~ (E1, £1) and (E, (2, x2)) ~ (E2, £2). (B.4) 


From (B.3) and (B.4), we have (£1, x1) ~ (€2,z2). O 


C 


Coherence 


Coherence was originally introduced by de Finetti to show any quantification 
of uncertainty that does not satisfy the axioms of a (finitely additive) prob- 
ability distribution would lead to sure loss in a suitably chosen gample. This 
is formally stated in Theorem C.1 below. This section is based on Schervish 
(1995, pp. 654, 655) except that we use finite additivity instead of countable 
additivity. 


Definition 1. For a bounded random variable X, the fair price or prevision 
P(X) is a number p such that a gambler is willing to accept all gambles of the 
form c(X — p) for all c in some sufficiently small symmetric interval around 
0. Here c(X — p) represents the gain to the gambler. That the values of c are 
sufficiently small ensures all losses are within the means of the gambler to pay, 
at least for bounded X. 


Definition 2. Let {X.,a € A} be a collection of bounded random variables. 
Suppose that for each Xa, P(X,,) is the prevision of a gambler who is will- 
ing to accept all gambles of the form c(Xg — P(Xq)) for -da < c < dg. 
These previsions are defined to be coherent if there do not exist a finite 
set Ap C A and {cg : -d < cag < dja € Ag}, d < min{d,,a E Apo}, 
such that J acs, Ca(Xa — P(Xa)) < 0 for all values of the random vari- 
ables. It is assumed that a gambler willing to accept each of a finite num- 
ber of gambles ca(Xq — P(Xa)),a@ E Ao is also willing to take the gam- 
ble doaea, (Xa — P(Xa)), c sufficiently small, for finite sets Ao. If each 
Xa takes only a finite number of distinct values (as in Theorem C.1 be- 
low), then >) ,c¢4, Ca(Xa — P(Xa)) < 0 for all values of X,’s if and only if 
Daca, CalXa ~ P(Xa)) < —e for all values of Xq’s, for some € > 0. The 
second condition is what de Finetti requires. 


If the previsions of a gambler are not coherent (incoherent), then he can 


be forced to lose money always in a suitably chosen gamble. 


Theorem C.1. Let (S, A) be a “measurable” space. Suppose that for each 
A € A, the prevision is P(I4), where Ia denotes the indicator of A. Then the 
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previsions are coherent if and only if the set function u, defined as u(A) = 
P(Ia), is a finitely additive probability on A. 


Proof. Suppose p is a finitely additive probability on (S, A). Let {Aj,..., Am} 
be any finite collection of elements in A and suppose the gambler is ready to 
accept gambles of the form c;(I4, — P(Ja,)). Then 


Z= Yella, — P(Ia,)) 


has p-expectation equal to 0, and therefore it is not possible that Z is always 
less than 0. This implies incoherence cannot happen. 

Conversely, assume coherence. We show that p is a finitely additive prob- 
ability by showing any violation of the probability axiom leads to a non-zero 
non-random gamble that can be made negative. 

(i) u(d) = 0 : Because [(d) = 0, —cu(ġ) = c(I(¢) — u(ġ)) > 0 for some 
positive and negative values of c, implying u(¢) = 0. Similarly (S$) = 1. 

(ii) u(A) > OVA € A: If (A) < 0, then for any c < 0, c(I4 — p(A)) < 
—cpu( A) < 0. This means there is incoherence. 

(iii) p is finitely additive : Let A,,---,Am be disjoint sets in A and 
U. A; = Á. Let 


m 


Z = X elI(A;) — u(4:)) — ca — (A) = cla(A) — $ (A). 


{=l 


If u(A) < `; (Ai), then Z is always negative for any c > 0, whereas 
u(A) > 57", u(A;) implies Z is always negative for any c < 0. Thus (A) # 
S2] H(A;) leads to incoherence. O 


D 


Microarray 


Proteins are essential for sustaining life of a living organism. Every cell in 
an individual has the same information for production of a large number of 
proteins. This information is encoded in the DNA. The information is tran- 
scribed and translated by the cell machinery to produce proteins. Different 
proteins are produced by different segments of the DNA that are called genes. 
Although every cell has the same information for production of the same set 
of proteins, all cells do not produce all proteins. Within an individual, cells 
are organized into groups that are specialized to perform specific tasks. Such 
groups of specialized cells are called tissues. A number of tissues makes up an 
organ, such as pancreas. Two tissues may produce completely disjoint sets of 
proteins; or, may produce the same protein in different quantities. 

The molecule that transfers information from the genes for the production 
of proteins is called the messenger RNA (mRNA). Genes that produce a lot of 
mRNA are said to be upregulated and genes that produce little or no mRNA 
are said to be downregulated. For example, in certain cells of the pancreas, 
the gene that produces insulin will be upregulated (that is, large amounts of 
insulin mRNA will be produced), whereas it will be downregulated in the liver 
(because insulin is produced only by certain cells of the pancreas and by no 
other organ in the human body). In certain disease states, such as diabetes, 
there will be alteration in the amount of insulin mRNA. 

A microarray is a tool for measuring the amount of mRNA that is circu- 
lating in a cell. Microarrays simultaneously measure the amount of circulating 
mRNA corresponding with thousands of different genes. Among various ap- 
plications, such data are helpful in understanding the nature and extent of 
involvement of different genes in various diseases, such as cancer, diabetes, 
etc. 

A generic microarray consists of multiple spots of DNA and is used to 
determine the quantities of mRNA in a collection of cells. The DNA in each 
spot is from a gene of interest and serves as a probe for the mRNA encoded by 
that gene. In general, one can think of a microarray as a grid (or a matrix) of 
several thousand DNA spots in a very small area (glass or polymer surface). 
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Each spot has a unique DNA sequence, different from the DNA sequence of 
the other spots around it. 

mRNA from a clump of cells (that is, all the mRNAs produced by the dif- 
ferent genes that are expressed in these cells) is extracted and experimentally 
converted (using a biochemical molecule called reverse transcriptase) to their 
complementary DNA strands (cDNA). A molecular tag that glows is attached 
to each piece of cDNA. This mixture is then “poured” over a microarray. Each 
DNA spot in the microarray will hybridize (that is, attach itself) only to its 
complementary DNA strand. The amount of fluorescence (usually measured 
using a laser beam) at a particular spot on the microarray gives an indication 
as to how much mRNA of a particular type was present in the original sample. 

There are many sources of variability in the observations from a microarray 
experiment. Aside from the intrinsic biological variability across individuals or 
across tissues within the same individual, among the more important sources 
of variability are (a) method of mRNA extraction from the cells; (b) nature of 
fluorescent tags used; (b) temperature and time under which the experiment 
(that is, hybridization) was performed; (c) sensitivity of the laser detector in 
relation to the chemistry of the fluorescent tags used; and (d) the sensitivity 
and robustness of the image analysis system that is used to identify and quan- 
tify the fluorescence at each spot in the microarray. All of these experimental 
factors come in the way of comparing results across microarray experiments. 
Even within one experiment, the brightness of two spots can vary even when 
the same number of complementary DNA strands have hybridized to the spots, 
necessitating normalization of each image using statistical methods. 

The amount of mRNA is quantified by a fluorescent signal. Some spots 
on a microarray, after the chemical reaction, show high levels of fluorescence 
and some show low or no fluorescence. The genes that show high level of flu- 
orescence are likely to be expressed, whereas the genes corresponding with 
a low level of fluorescence are likely to be under-expressed or not expressed. 
Even genes that are not expressed may show low levels of fluorescence, which 
is treated as noise. The software package of the experimenter identifies back- 
ground noise and calculates its mean, which is subtracted from all the mea- 
surements of fluorescence. This is the final observation X; that we model with 
u = 0 indicating no expression, u > 0 indicating expression, and u < 0 in- 
dicating negative expression, i.e, under-expression. If the genes turn out in 
further studies to regulate growth of tumors, the expressed genes might help 
growth while the under-expressed genes could inhibit it. 


E 


Bayes Sufficiency 


If T is a sufficient statistic, then, at least for the discrete and continuous 
case with p.d.f., an application of the factorization theorem implies posterior 
distribution of @ given X is the posterior given T. Thus in a Bayesian sense 
also, all information about 0 contained in X is carried by T. In many cases, 
e.g., for multivariate normal, the calculation of posterior can be simplified by 
an application of this fact. 

More importantly, these considerations suggest an alternative definition of 
sufficiency appropriate in Bayesian analysis. 


Definition. A statistic T is sufficient in a Bayesian sense, if for all priors 
m(0), the posterior n(9| X) = 1(O|T(X)) 

Classical sufficiency always implies sufficiency in the Bayesian sense. It can 
be shown that if the family of probability measures in the model is dominated, 
i.e., the probability measures possess densities with respect to a o-finite mea- 
sure, then the factorization theorem holds, vide Lehmann (1986). In this case, 
it can be shown that the converse is true, i.e., T is sufficient in the classical 
sense if it is sufficient in the Bayesian sense. 

A famous counter-example due to Blackwell and Ramamoorthi (1982) 
shows this is not true in the undominated case even under nice set theoretic 
conditions. 
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one-sided test, 176 
orthogonal parameters, 47 
outlier, 6, 185-188 


350 Subject Index 


P-value, 26, 163-202 
Bayesian, 159, 161, 181 
conditional predictive, 183, 184 
partial posterior predictive, 184 
partial predictive, 185 
posterior predictive, 181, 182 
prior predictive, 180 
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Stein’s example, 255 
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